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Abstract 

We study the question of testing structured properties (classes) of discrete distributions. 
Specifically, given sample access to an arbitrary distribution D over [n] and a property V, the 
goal is to distinguish between D € V and IxlfD^V) > e. We develop a general algorithm for this 
question, which applies to a large range of “shape-constrained” properties, including monotone, 
log-concave, f-modal, piecewise-polynomial, and Poisson Binomial distributions. Moreover, for 
all cases considered, our algorithm has near-optimal sample complexity with regard to the 
domain size and is computationally efficient. For most of these classes, we provide the first 
non-trivial tester in the literature. In addition, we also describe a generic method to prove lower 
bounds for this problem, and use it to show our upper bounds are nearly tight. Finally, we 
extend some of our techniques to tolerant testing, deriving nearly-tight upper and lower bounds 
for the corresponding questions. 


1 Introduction 

Inferring information about the probability distribution that underlies a data sample is an essential 
question in Statistics, and one that has ramifications in every field of the natural sciences and 
quantitative research. In many situations, it is natural to assume that this data exhibits some 
simple structure because of known properties of the origin of the data, and in fact these assumptions 
are crucial in making the problem tractable. Such assumptions translate as constraints on the 
probability distribution - e.g., it is supposed to be Gaussian, or to meet a smoothness or “fat tail” 
condition (see e.g., [Man63, Hou86, TLSM95]). 

As a result, the problem of deciding whether a distribution possesses such a structural property 
has been widely investigated both in theory and practice, in the context of shape restricted infer¬ 
ence [BDBB72, SS01] and model selection [MP07]. Here, it is guaranteed or thought that the un¬ 
known distribution satisfies a shape constraint, such as having a monotone or log-concave probabil¬ 
ity density function [SN99, BB05, Wal09, Dial6]. From a different perspective, a recent line of work 

‘Columbia University. Email: ccanonne@cs.columbia.edu. Research supported by NSF CCF-1115703 and NSF 
CCF-1319788. 

^University of Edinburgh. Email: ilias.d@ed.ac.uk. Research supported by EPSRC grant EP/L021749/1, a 
Marie Curie Career Integration Grant, and a SICSA grant. This work was performed in part while visiting CSAIL, 
MIT. 

■''CSAIL, MIT. Email: tgoule@mit.edu. 

5 CSAIL, MIT and the Blavatnik School of Computer Science, Tel Aviv University. Email: ronitt@csail.mit. edu. 


1 



in Theoretical Computer Science, originating from the papers of Batu et al. [BFR + 00, BFF + 01, 
GROO] has also been tackling similar questions in the setting of property testing (see [Ron08, RonlO, 
Rubl2, Canl5] for surveys on this field). This very active area has seen a spate of results and break¬ 
throughs over the past decade, culminating in very efficient (both sample and time-wise) algorithms 
for a wide range of distribution testing problems [BDKR05, GMV06, AAK + 07, DDS + 13, CDVV14, 
AD15, DKN15b]. In many cases, this led to a tight characterization of the number of samples re¬ 
quired for these tasks as well as the development of new tools and techniques, drawing connections 
to learning and information theory [VV10, VVlla, VV14]. 

In this paper, we focus on the following general property testing problem: given a class (prop¬ 
erty) of distributions V and sample access to an arbitrary distribution D , one must distinguish 
between the case that (a) D £ V, versus (b) \\D — D'\\ l > e for all D' £ V (i.e., D is either in 
the class, or far from it). While many of the previous works have focused on the testing of spe¬ 
cific properties of distributions or obtained algorithms and lower bounds on a case-by-case basis, an 
emerging trend in distribution testing is to design general frameworks that can be applied to several 
property testing problems [Valll, VVlla, DKN15b, DKN15a]. This direction, the testing analog 
of a similar movement in distribution learning [CDSS13, CDSS14b, CDSS14a, ADLS15], aims at 
abstracting the minimal assumptions that are shared by a large variety of problems, and giving 
algorithms that can be used for any of these problems. In this work, we make significant progress 
in this direction by providing a unified framework for the question of testing various properties of 
probability distributions. More specifically, we describe a generic technique to obtain upper bounds 
on the sample complexity of this question, which applies to a broad range of structured classes. 
Our technique yields sample near-optimal and computationally efficient testers for a wide range 
of distribution families. Conversely, we also develop a general approach to prove lower bounds 
on these sample complexities, and use it to derive tight or nearly tight bounds for many of these 
classes. 

Related work. Batu et al. [BKR04] initiated the study of efficient property testers for monotonic- 
ity and obtained (nearly) matching upper and lower bounds for this problem; while [AD 15] later con¬ 
sidered testing the class of Poisson Binomial Distributions, and settled the sample complexity of this 
problem (up to the precise dependence on e). Indyk, Levi, and Rubinfeld [ILR12], focusing on distri¬ 
butions that are piecewise constant on t intervals (“t-histograms”) described a 0(\/tn/£ 5 )-sample 
algorithm for testing membership to this class. Another body of work by [BDKR05], [BKR04], 
and [DDS + 13] shows how assumptions on the shape of the distributions can lead to significantly 
more efficient algorithms. They describe such improvements in the case of identity and closeness 
testing as well as for entropy estimation, under monotonicity or fc-modality constraints. Specif¬ 
ically, Batu et al. show in [BKR04] how to obtain a O (log 3 n/e 3 )-sample tester for closeness in 
this setting, in stark contrast to the fl(n 2//3 ) general lower bound. Daskalakis et al. [DDS + 13] 
later gave 0(\/logn) and 0(log 2//3 n)-sample testing algorithms for testing respectively identity 
and closeness of monotone distributions, and obtained similar results for /c-modal distributions. 
Finally, we briefly mention two related results, due respectively to [BDKR05] and [DDS12a]. The 
first one states that for the task of getting a multiplicative estimate of the entropy of a distribution, 
assuming monotonicity enables exponential savings in sample complexity - 0(log 6 n), instead of 
D(n c ) for the general case. The second describes how to test if an unknown fc-modal distribution is 
in fact monotone, using only 0(k/e 2 ) samples. Note that the latter line of work differs from ours 
in that it presupposes the distributions satisfy some structural property, and uses this knowledge 
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to test something else about the distribution; while we are given a priori arbitrary distributions, 
and must check whether the structural property holds. Except for the properties of monotonicity 
and being a PBD, nothing was previously known on testing the shape restricted properties that we 
study. Independently and concurrently to this work, Acharya, Daskalakis, and Kamath obtained a 
sample near-optimal efficient algorithm for testing log-concavity. 1 

Moreover, for the specific problems of identity and closeness testing, 2 recent results of [DKN15b, 
DKN15a] describe a general algorithm which applies to a large range of shape or structural con¬ 
straints, and yields optimal identity testers for classes of distributions that satisfy them. We observe 
that while the question they answer can be cast as a specialized instance of membership testing, 
our results are incomparable to theirs, both because of the distinction above (testing with versus 
testing for structure) and as the structural assumptions they rely on are fundamentally different 
from ours. 

1.1 Results and Techniques 

Upper Bounds. A natural way to tackle our membership testing problem would be to first 
learn the unknown distribution D as if it satisfied the property, before checking if the hypothesis 
obtained is indeed both close to the original distribution and to the property. Taking advantage 
of the purported structure, the first step could presumably be conducted with a small number 
of samples; things break down, however, in the second step. Indeed, most approximation results 
leading to the improved learning algorithms one would apply in the first stage only provide very 
weak guarantees, in the t\ sense. For this reason, they lack the robustness that would be required 
for the second part, where it becomes necessary to perform tolerant testing between the hypothesis 
and D - a task that would then entail a number of samples almost linear in the domain size. To 
overcome this difficulty, we need to move away from these global t\ closeness results and instead 
work with stronger requirements, this time in 1 2 norm. 

At the core of our approach is an idea of Batu et al. [BKR04] , which show that monotone dis¬ 
tributions can be well-approximated (in a certain technical sense) by piecewise constant densities 
on a suitable interval partition of the domain; and leverage this fact to reduce monotonicity testing 
to uniformity testing on each interval of this partition. While the argument of [BKR04] is tailored 
specifically for the setting of monotonicity testing, we are able to abstract the key ingredients, and 
obtain a generic membership tester that applies to a wide range of distribution families. In more de¬ 
tail, we provide a testing algorithm which applies to any class of distributions which admits succinct 
approximate decompositions - that is, each distribution in the class can be well-approximated (in 
a strong £2 sense) by piecewise constant densities on a small number of intervals (we hereafter refer 
to this approximation property, formally defined in Definition 3.1, as (Succinctness); and extend 
the notation to apply to any class C of distributions for which all D € C satisfy (Succinctness)). 
Crucially, the algorithm does not care about how these decompositions can be obtained: for the 
purpose of testing these structural properties we only need to establish their existence. Specific 
examples are given in the corollaries below. Informally, our main algorithmic result, informally 

following the communication of a preliminary version of this paper (February 2015), we were informed 
that [ADK15] subsequently obtained near-optimal testers for some of the classes we consider. To the best of our 
knowledge, their work builds on ideas from [AD15] and their techniques are orthogonal to ours. 

2 Recall that the identity testing problem asks, given the explicit description of a distribution D * and sample access 
to an unknown distribution D, to decide whether D is equal to D* or far from it; while in closeness testing both 
distributions to compare are unknown. 
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stated (see Theorem 3.3 for a detailed formal statement), is as follows: 

Theorem 1.1 (Main Theorem). There exists an algorithm TestSplittable which, given sam¬ 
pling access to an unknown distribution D over [n] and parameter e E (0,1] , can distinguish with 
probability 2/3 between (a) D E V versus (b) t\ (D,V) > e, for any property V that satisfies the 
above natural structural criterion (Succinctness). Moreover, for many such properties this algorithm 
is computationally efficient, and its sample complexity is optimal (up to logarithmic factors and the 
exact dependence on e). 

We then instantiate this result to obtain “out-of-the-box” computationally efficient testers for sev¬ 
eral classes of distributions, by showing that they satisfy the premise of our theorem (the definition 
of these classes is given in Section 2.1): 

Corollary 1.2. The algorithm TestSplittable can test the classes of monotone, unimodal, log- 
concave, concave, convex, and monotone hazard rate (MHR) distributions, with O^y/n/e 7 ^ 2 ^ sam¬ 
ples. 

Corollary 1.3. The algorithm TestSplittable can test the class of t-modal distributions, with 
o[y/tn/e 7 ^ 2 j samples. 

Corollary 1.4. The algorithm TestSplittable can test the classes of t-histograms and t-piecewise 
degree-d distributions, with O^y/tn/e^ and C)(^y/t(d + 1 )n/e 7 ^ 2 + t(d + l)/e 3 ) samples respectively. 

Corollary 1.5. The algorithm TestSplittable can test the classes of Binomial and Poisson 
Binomial Distributions, with Ofn 1 / 4 /e 7 / 2 ) samples. 


Class 

Upperbound 

Lowerbound 

Monotone 

O(^) [BKR04], 0(4§) (Corollary 1.2) 


) [BKR04] , (Corollary 1.6) 

Unimodal 

O(jfc) (Corollary 1.2) 


n(#) 

(Corollary 1.6) 

t-modal 

(Corollary 1.3) 


n(#) 

(Corollary 1.6) 

Log-concave, concave, 

convex 

O(j0k^ (Corollary 1.2) 


«(#) 

(Corollary 1.6) 

Monotone Hazard Rate 
(MHR) 

(Corollary 1.2) 


fi(#) 

(Corollary 1.6) 

Binomial, Poisson Bino¬ 
mial (PBD) 

o(^ + L) [ADIS], 

0 ( 771 ) (Corollary 1.5) 

n 

([AD15], Corollary 1.7) 

t-histograms 

O(^) [ILR12], (Corollary 1.4) 

£l(y/tn) for t < 7 [ILR12], (Corollary 1.6) 

t-piecewise degree-d 

O ( V % + 2 1)n + ^ ) (Corollary 1.4) 


o(#) 

(Corollary 1.6) 

fc-SIIRV 


Q^n 1 / 4 ) (Corollary 1.8) 


Table 1: Summary of results. 

We remark that the aforementioned sample upper bounds are information-theoretically near- 
optimal in the domain size n (up to logarithmic factors). See Table 1 and the following subsection 
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for the corresponding lower bounds. We did not attempt to optimize the dependence on the 
parameter e, though a more careful analysis can lead to such improvements. 

We stress that prior to our work, no non-trivial testing bound was known for most of these 
classes - specifically, our nearly-tight bounds for i-modal with t > 1, log-concave, concave, convex, 
MHR, and piecewise polynomial distributions are new. Moreover, although a few of our applica¬ 
tions were known in the literature (the 0(\/n/e e ) upper and fl(y Hi/e 2 ) lower bounds on testing 
monotonicity can be found in [BKR04], while the ©(re 1 / 4 ) sample complexity of testing PBDs was 
recently given' 1 in [AD15], and the task of testing f-histograms is considered in [ILR12]), the crux 
here is that we are able to derive them in a unified way, by applying the same generic algorithm to all 
these different distribution families. We note that our upper bound for f-histograms (Corollary 1.4) 
also improves on the previous O^y/tn/e 5 ^-sample tester, as long as t = O^n 1 / 3 /e 2 ^j. In addition 
to its generality, our framework yields much cleaner and conceptually simpler proofs of the upper 
and lower bounds from [AD15]. 

Lower Bounds. To complement our upper bounds, we give a generic framework for proving lower 
bounds against testing classes of distributions. In more detail, we describe how to reduce - under 
a mild assumption on the property C - the problem of testing membership to C (“does D E C?”) 
to testing identity to D* (“does D = D*?”), for any explicit distribution D* in C. While these 
two problems need not in general be related, 4 we show that our reduction-based approach applies 
to a large number of natural properties, and obtain lower bounds that nearly match our upper 
bounds for all of them. Moreover, this lets us derive a simple proof of the lower bound of [AD 15] 
on testing the class of PBDs. The reader is referred to Theorem 6.1 for the formal statement of our 
reduction-based lower bound theorem. In this section, we state the concrete corollaries we obtain 
for specific structured distribution families: 

Corollary 1.6. Testing log-concavity, convexity, concavity, MHR, unimodality, t-modality, t-histograms, 
and t-piecewise degree-d distributions each require D(y/n/e 2 ) samples (the last three for t = o(y/n) 
and t(d + 1) = o(y / re), respectively), for any e > 1 /n 0<Kl \ 

Corollary 1.7. Testing the classes of Binomial and Poisson Binomial Distributions each require 
D^re 4 / 4 /e 2 ^ samples, for any e > 1 /n 0 ^. 

Corollary 1.8. There exist absolute constants c > 0 and £o > 0 such that testing the class of 
k-SIIRV distributions requires D(A: 1 ' /2 ?r 1//4 ) samples, for any k = o(re c ) and £ < £o- 

Tolerant Testing. Using our techniques, we also establish nearly-tight upper and lower bounds 
on tolerant testing for shape restrictions. Similarly, our upper and lower bounds are matching as 

3 For the sample complexity of testing monotonicity, [BKR04] originally states an O^y/n/e 4 ) upper bound, but the 
proof seems to only result in an O^^/tl/e 6 ) bound. Regarding the class of PBDs, [AD15] obtain an rW 4 • (5(l/e 2 ) + 
(5(l/e 6 ) sample complexity, to be compared with our 0(n 1//4 /e 7,/2 ) + 0(log 4 n/e 4 ) upper bound; as well as an 
Q.iji 1 ^ 4 /e 2 ) lower bound. 

4 As a simple example, consider the class C of all distributions, for which testing membership is trivial. 

4 Tolerant testing of a property V is defined as follows: given 0 < £i < £2 < 1, one must distinguish between (a) 

£1 (D,V) < £1 and (b) l\(D,V) > £ 2 . This turns out to be, in general, a much harder task than that of “regular” 
testing (where we take £1 = 0). 
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a function of the domain size. More specifically, we give a simple generic upper bound approach 
(namely, a learning followed by tolerant testing algorithm). Our tolerant testing lower bounds 
follow the same reduction-based approach as in the non-tolerant case. In more detail, our results 
are as follows (see Section 6 and Section 7): 

Corollary 1.9. Tolerant testing of log-concavity, convexity, concavity, MHR, unimodality, and t- 
modality can be performed with isf^) samples, for £2 > Ce\ (where C > 2 is an absolute 

constant). 


Corollary 1.10. Tolerant testing of the classes of Binomial and Poisson Binomial Distributions 
can be performed with 0( ^ 2 _ 1 £i j2 samples, for £2 > C£i (where C > 2 is an absolute 

constant). 


Corollary 1.11. Tolerant testing of log-concavity, convexity, concavity, MHR, unimodality, and 
t-modality each require samples (the latter for t = o(n)). 

Corollary 1.12. Tolerant testing of the classes of Binomial and Poisson Binomial Distributions 
each require TogTl) samples. 

On the scope of our results. We point out that our main theorem is likely to apply to many 
other classes of structured distributions, due to the mild structural assumptions it requires. How¬ 
ever, we did not attempt here to be comprehensive; but rather to illustrate the generality of our 
approach. Moreover, for all properties considered in this paper the generic upper and lower bounds 
we derive through our methods turn out to be optimal up to at most polylogarithmic factors (with 
regard to the support size). The reader is referred to Table 1 for a summary of our results and 
related work. 


1.2 Organization of the Paper 

We start by giving the necessary background and definitions in Section 2, before turning to our 
main result, the proof of Theorem 1.1 (our general testing algorithm) in Section 3. In Section 4, we 
establish the necessary structural theorems for each classes of distributions considered, enabling us 
to derive the upper bounds of Table 1. Section 5 introduces a slight modification of our algorithm 
which yields stronger testing results for classes of distributions with small effective support, and use 
it to derive Corollary 1.5, our upper bound for Poisson Binomial distributions. Second, Section 6 
contains the details of our lower bound methodology, and of its applications to the classes of Table 1. 
Finally, Section 6.2 is concerned with the extension of this methodology to tolerant testing, of 
which Section 7 describes a generic upper bound counterpart. 


2 Notation and Preliminaries 

2.1 Definitions 

We give here the formal descriptions of the classes of distributions involved in this work. Recall that 
a distribution D over [n] is monotone (non-increasing) if its probability mass function (pmf) satisfies 
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D( 1) > D( 2) > ... D(n). A natural generalization of the class M of monotone distributions is the 
set of f-modal distributions, i.e. distributions whose pmf can go “up and down” or “down and up” 
up to t times: 5 

Definition 2.1 (t-modal). Fix any distribution D over [n], and integer t. D is said to have t 
modes if there exists a sequence io < ■ ■ ■ < it+i such that either (—1 ) J D(ij) < (— l ) 3 for all 
0 < j < t, or (—1 yD{ij) > (— l)iD(ij + i) for all 0 < j < t. We call D t-modal if it has at most 
t modes, and write Afj for the class of all t-modal distributions (omitting the dependence on n). 
The particular case of t = 1 corresponds to the set of unimodal distributions. 

Definition 2.2 (Log-Concave). A distribution D over [n] is said to be log-concave if it satisfies the 
following conditions: (i) for any l<i<j<k<n such that D(i)D(k ) > 0, D{j) > 0; and (ii) for 
all 1 < k < n, D(k ) 2 > D(k — 1 )D(k + 1). We write L for the class of all log-concave distributions 
(omitting the dependence on n). 

Definition 2.3 (Concave and Convex). A distribution D over [n] is said to be concave if it satisfies 
the following conditions: (i) for any \<i<j<k<n such that D{i)D{k ) > 0, D (j ) > 0; and (ii) 
for all 1 < k < n such that D(k — 1 )D(k + 1) > 0, 2 D(k) > D(k — 1) + D(k + 1); it is convex if the 
reverse inequality holds in (ii). We write JC~ (resp. /C + ) for the class of all concave (resp. convex) 
distributions (omitting the dependence on n). 


ft is not hard to see that convex and concave distributions are unimodal; moreover, every concave 
distribution is also log-concave, i.e. K,~ C C. Note that in both Definition 2.2 and Definition 2.3, 
condition (i) is equivalent to enforcing that the distribution be supported on an interval. 


Definition 2.4 (Monotone Hazard Rate). A distribution D over [?r] is said to have monotone 
hazard rate (MHR) if its hazard rate H(i ) a non- decreasing function. We write 

A47-L1Z for the class of all MHR distributions (omitting the dependence on n). 


ft is known that every log-concave distribution is both unimodal and MHR (see e.g. [An96, 
Proposition 10]), and that monotone distributions are MHR. Two other classes of distributions 
have elicited significant interest in the context of density estimation, that of histograms (piecewise 
constant) and piecewise polynomial densities : 


Definition 2.5 (Piecewise Polynomials [CDSS14a]). A distribution D over [n] is said to be a t- 
piecewise degree-d distribution if there is a partition of [n] into t disjoint intervals I\..... If such 
that D(i) = Pj(i) for all i € Ij, where each p\, .. .pt is a univariate polynomial of degree at most 
d. We write Vt,d f° r the class of all t-piecewise degree-d distributions (omitting the dependence on 
n). (We note that t-piecewise degree-0 distributions are also commonly referred to as t-histograms, 
and write Tit for Vt, o-) 

Finally, we recall the definition of the two following classes, which both extend the family of 
Binomial distributions BIJV n : the first, by removing the need for each of the independent Bernoulli 
summands to share the same bias parameter. 

5 Note that this slightly deviates from the Statistics literature, where only the peaks are counted as modes (so that 
what is usually referred to as a bimodal distribution is, according to our definition, 3-modal). 
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Definition 2.6. A random variable X is said to follow a Poisson Binomial Distribution (with 
parameter n E N) if it can be written as X = X]fc=i -Xfc, where X\ ..., X n are independent, non- 
necessarily identically distributed Bernoulli random variables. We denote by VBD n the class of all 
such Poisson Binomial Distributions. 

It is not hard to show that Poisson Binomial Distributions are in particular log-concave. One can 
generalize even further, by allowing each random variable of the summation to be integer-valued: 

Definition 2.7. Fix any k > 0. We say a random variable X is a k-Sum of Independent Integer 
Random Variables (k-SIIRV) with parameter n E N if it can be written as X = YfJj=\Xj, where 
X\ ..., X n are independent, non-necessarily identically distributed random variables taking value 
in {0,1,... , k — 1}. We denote by k-STT1ZV n the class of all such fc-SIIRVs. 


2.2 Tools from previous work 

We first restate a result of Batu et al. relating closeness to uniformity in £2 and l\ norms to “overall 
flatness” of the probability mass function, and which will be one of the ingredients of the proof of 
Theorem 1.1: 


Lemma 2.8 ([BFR + 00, BFF + 01|). Let D be a distribution on a domain S. (a) If max,; esD{i) < 
(1 + e) mhijgs D(i), then \\DW 2 < (1 + e 2 )/ 151. (b) If ||D|| 2 < (1 + e 2 )/ |£|, then \\D — Us\\ l < e. 

To check condition (b) above we shall rely on the following, which one can derive from the techniques 
in [DKN15b] and whose proof we defer to Appendix A: 

Lemma 2.9 (Adapted from [DKN15b, Theorem 11]). There exists an algorithm Check-Small-I^ 
which, given parameters e,5 E (0,1) and c- ^/fTf/e 2 log(l/<5) independent samples from a distribution 
D over I (for some absolute constant c > 0), outputs either yes or no, and satisfies the following. 

• If ||D — Uj || 2 > e/\/\I\, then the algorithm outputs no with probability at least 1 — 5; 

• If ||D —Ui || 2 < e/2\/]7[, then the algorithm outputs yes with probability at least 1 — 6. 

Finally, we will also rely on a classical result from Probability, the Dvoretzky-Kiefer-Wolfowitz 
(DKW) inequality, restated below: 

Theorem 2.10 ([DKW56, Mas90|). Let D be a distribution over [n]. Given m independent samples 
xi,... ,x m from D, define the empirical distribution D as follows: 


fifos drf |{ 3 £ H : Xj ^ i II 
m 


i E [n|. 


Then, for all £ > 0, Pr 


\D — D\ 


Kol 


> £ 


< 2e 


—2 me 2 


where ||-— -11 Ko i denotes the Kolmogorov 


distance (i.e., the I^ distance between cumulative distribution functions). 


In particular, this implies that 0( 1/e 2 ) samples suffice to learn a distribution up to e in Kolmogorov 
distance. 





3 The General Algorithm 

In this section, we obtain our main result, restated below: 

Theorem 1.1 (Main Theorem). There exists an algorithm TestSplittable which, given sam¬ 
pling access to an unknown distribution D over [n] and parameter e e (0,1], can distinguish with 
probability 2/3 between (a) D € V versus (b) £i(D,V) > e, for any property V that satisfies the 
above natural structural criterion (Succinctness). Moreover, for many such properties this algorithm 
is computationally efficient, and its sample complexity is optimal (up to logarithmic factors and the 
exact dependence on e). 

Intuition. Before diving into the proof of this theorem, we first provide a high-level description 
of the argument. The algorithm proceeds in 3 stages: the first, the decomposition step , attempts 
to recursively construct a partition of the domain in a small number of intervals, with a very 
strong guarantee. If the decomposition succeeds, then the unknown distribution D will be close 
(in t\ distance) to its “flattening” on the partition; while if it fails (too many intervals have to be 
created), this serves as evidence that D does not belong to the class and we can reject. The second 
stage, the approximation step , then learns this flattening of the distribution - which can be done 
with few samples since by construction we do not have many intervals. The last stage is purely 
computational, the projection step: where we verify that the flattening we have learned is indeed 
close to the class C. If all three stages succeed, then by the triangle inequality it must be the case 
that D is close to C; and by the structural assumption on the class, if D E C then it will admit 
succinct enough partitions, and all three stages will go through. 

Turning to the proof, we start by defining formally the “structural criterion” we shall rely on, before 
describing the algorithm at the heart of our result in Section 3.1. (We note that a modification of 
this algorithm will be described in Section 5, and will allow us to derive Corollary 1.5.) 

Definition 3.1 (Decompositions). Let 7 > 0 and L = L(y, n) > 1. A class of distributions C 
on [n] is said to be (7 , L)-decomposable if for every D e C there exists t < L and a partition 
Z( 7 , D) = (Ji,..., If) of the interval [1, n] such that, for all j € [l], one of the following holds: 

(i) D{Ij) < J; or 

(ii) ma xD(i) < (1 + 7 ) • min D(i). 
i£lj 

Further, if Tfy, D ) is dyadic (i.e., each If is of the form [j • 2* + 1, (j + 1) • 2 l ] for some integers i,j, 
corresponding to the leaves of a recursive bisection of [n]), then C is said to be ( 7 , L)-splittable. 

Lemma 3.2. If C is (7 , L)-decomposable, then it is ( 7 , 0(L log n))-splittable. 

Proof. We will begin by proving a claim that for every partition I = { I\,l 2 , ■■■II} of the interval 
[l,n] into L intervals, there exists a refinement of that partition which consists of at most L ■ logn 
dyadic intervals. So, it suffices to prove that every interval [a, 6 ] C [l,n], can be partitioned in at 
most O(logn) dyadic intervals. Indeed, let £ be the largest integer such that 2 ( < pp and let m be 
the smallest integer such that m-2 £ > a. If follows that m ■ 2 e < a + and (m + 1) • 2 e < b. 

So, the interval I =\m ■ 2 f + 1 , (m + 1) • 2 £ ] is fully contained in [a, b] and has size at least 
We will also use the fact that, for every £' < l , 

m-2 e = m- 2 i ~ 1 ' ■ 2‘ v = m' ■ 2 1 ' (1) 
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Now consider the following procedure: Starting from right (resp. left) side of the interval /, 
we add the largest interval which is adjacent to it and fully contained in [a, 6 ] and recurse until we 
cover the whole interval [(m + 1) • 2 £ + 1, b] (resp. [a, m ■ 2^]). Clearly, at the end of this procedure, 
the whole interval [a, 6 ] is covered by dyadic intervals. It remains to show that the procedure takes 
O(logn) steps. Indeed, using Equation 1, we can see that at least half of the remaining left or right 
interval is covered in each step (except maybe for the first 2 steps where it is at least a quarter). 
Thus, the procedure will take at most 21ogn + 2 = O(logn) steps in total. From the above, we can 
see that each of the L intervals of the partition I can be covered with O(logn) dyadic intervals, 
which completes the proof of the claim. 

In order to complete the proof of the lemma, notice that the two conditions in Definition 3.1 
are closed under taking subsets. □ 

3.1 The algorithm 

Theorem 1.1, and with it Corollary 1.2 and Corollary 1.3 will follow from the theorem below, 
combined with the structural theorems from Section 4: 

Theorem 3.3. Let C be a class of distributions over [n] for which the following holds. 

1. C is ('y,L('y,n))-splittable; 

2. there exists a procedure ProjectionDistc which, given as input a parameter a £ (0,1) and 
the explicit description of a distribution D over [n], returns yes if the distance i\(D,C) to C 
is at most a/10, and no if l\(D,C) > 9a/10 (and either yes or no otherwise). 

Then, the algorithm TestSplittable (Algorithm 1) is a O ^max^VnL log n/e 3 , L/e 2 y) -sample 
tester for C, for L = L{e,n). (Moreover, if ProjectionDistc is computationally efficient, then 
so is TestSplittable./ 


3.2 Proof of Theorem 3.3 

We now give the proof of our main result (Theorem 3.3), first analyzing the sample complexity 
of Algorithm 1 before arguing its correctness. For the latter, we will need the following simple 
lemma from [ILR12], restated below: 

logr — 

Fact 3.4 ([ILR12, Fact 1]). Let D be a distribution over [n], and 5 £ (0,1]. Given m > C ■ —^ L 
independent samples from D (for some absolute constant C > 0), with probability at least 1 — 5 we 
have that, for every interval I C [n]: 

(i) if D(I) > l then < ^ < 

(H) if^>\, then D(L) > 

(Hi) if ^ < 2’ then D ( I ) < 'hi 

where mi = |{ j £ [m] : Xj € I }| is the number of the samples falling into I. 


3.3 Sample complexity. 


The sample complexity is immediate, and comes from Steps 4 and 20. The total number of samples 
is 


m + O 



O ( !og |/| + - iog |/| + 4 


O 


VWL 


log I J| + 
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Algorithm 1 TestSplittable 

Require: Domain / (interval), sample access to D over /; subroutine ProjectionDistc 
Input: Parameters e and function Lc(-, •). 
l: Setting Up 

2 : Define 7 = f ^, L = f Lc( 7 , |/|), k *== <5 = f and c > 0 be as in Lemma 2.9. 

3: Set m = f C • max • log |I| = O ^ > C is an absolute constant. 

4: Obtain a sequence s of m independent samples from D. > For any J C J, let mj be the 

number of samples falling in J. 

5: 

6 : Decomposition 

7: while mi > max log urn^j and at most L splits have been performed do 

8 : Run Check-Small -^2 (from Lemma 2.9) with parameters and 5, using the samples 

of s belonging to I. 

9: if Check-Small -^2 outputs no then 

10 : Bisect I, and recurse on both halves (using the same samples). 

11: end if 

12: end while 

13: if more than L splits have been performed then 

14: return REJECT 

15: else 

def 

16: Let X = (Ji,... , Ip) be the partition of [n] from the leaves of the recursion. > £ < L. 

17: end if 

18: 

19: Approximation 

20 : Learn the flattening <h(X),X) of D to l\ error ^ (with probability 1/10), using 0(£/e 2 ) new 

samples. Let D be the resulting hypothesis. > D is a Chistogram. 

21 : 

22 : Offline Check 

23: return ACCEPT if and only if PROJECTlONDlSTc(e, D) returns yes. > No sample needed. 

24: 


3.4 Correctness. 

Say an interval / considered during the execution of the “Decomposition” step is heavy if m/ is big 
enough on Step 7, and light otherwise; and let XC and L denote the sets of heavy and light intervals 
respectively. By choice of m and a union bound over all \I\ 2 possible intervals, we can assume on 
one hand that with probability at least 9/10 the guarantees of Fact 3.4 hold simultaneously for all 
intervals considered. We hereafter condition on this event. 

We first argue that if the algorithm does not reject in Step 13, then with probability at least 
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9/10 we have ||D — $(Z?,X )|| 1 < s/20. Indeed, we can write 


| J D-$(X>,X)|| 1 = X! Wk)-\\D Ik -Ui k \\ l + E D{I h )-\\D Ik -U Ik ||i 

k: I k £C k: h&'K 

<2 E x>(4)+ E ^(4)-p4-^4lli- 

k: 46M 


fc: /fcS£ 


Let us bound the two terms separately. 

• If I' £ IK, then by our choice of threshold we can apply Lemma 2.9 with 5 = conditioning 
on all of the (at most L) events happening, which overall fails with probability at most 1/10 
by a union bound, we get 


,2 1 . \ 1 


\\Dr\\i = \\Dp-UrWi + 7777 < 1 + 777777 7777 


\I'\ 


1600 \ r \ 


as Check-Small-^ returned yes; and by Lemma 2.8 this implies ||Dp < e/40. 

If I' € £, then we claim that D(I' ) < max(/v, 2c- log j). Clearly, this is true if D(I') < ac, 
so it only remains to show that D(I') < 2c • -^5 log |. But this follows from Fact 3.4 (i), as 
if we had D(I' ) > 2c • -^5 log 4 then mp would have been big enough, and I' ^ L. Overall, 


£d(/')< £ L + 2 c./gIlogl')<L K + 2 ^ 




rec 


me- 


I'eL 


VW\ 




log - < (1 + E 

& a - 160 1 ^ 


A'e£ 


'El 

|/|L 


“ 80 


for a sufficiently big choice of constant C > 0 in the definition of to; where we first used that 

IT 


|£| < L, and then that Yli'ec xljif — ^7 4ensen ’ s inequality. 

Putting it together, this yields 

II D - HD,T)\\i <2-^ + ^ E D (A) < £ / 40 + £ / 40 = £ / 20 - 

5U 4U //g;K 


Soundness. By contrapositive, we argue that if the test returns ACCEPT, then (with probability 
at least 2/3) D is e-close to C. Indeed, conditioning on D being e/20-close to we get 

by the triangle inequality that 

||D - C||j < H-D - $(D,X)|| 1 + ||4>(X>,X) - X>|li + dist (d,c) 

e e 9e 

<- 1 - 1 - =£. 

- 20 20 10 

Overall, this happens except with probability at most 1/10 + 1/10 + 1/10 < 1/3. 

Completeness. Assume D € C. Then the choice of of 7 and L ensures the existence of a good 
dyadic partition X(y, D ) in the sense of Definition 3.1. For any I in this partition for which (i) 
holds ( D(I) < J < f), I will have yy- < k and be kept as a “light leaf” (this by contrapositive 
of Fact 3.4 (ii)). For the other ones, (ii) holds: let / be one of these (at most L ) intervals. 

• If mj is too small on Step 7, then / is kept as “light leaf.” 
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• Otherwise, then by our choice of constants we can use Lemma 2.8 and apply Lemma 2.9 
with 6 = : conditioning on all of the (at most L) events happening, which overall 

fails with probability at most 1/10 by a union bound, Check-Small-^ will output yes, 
as 

\\ Di ~ Ui \\ 2 = ||H /|| 2 - jjj < ( 1 + ]/[ “ |7[ = 6400 171 

and I is kept as “flat leaf.” 

Therefore, as X(y, D) is dyadic the Decomposition stage is guaranteed to stop within at 
most L splits (in the worst case, it goes on until X(y ,D) is considered, at which point it 
succeeds ). 6 * * * Thus Step 13 passes, and the algorithm reaches the Approximation stage. By 
the foregoing discussion, this implies 4>(X>,X) is e/20-close to D (and hence to C); D is then 
(except with probability at most 1/10) (^j + = ^j)-close to C, and the algorithm returns 

ACCEPT. 


4 Structural Theorems 

In this section, we show that a wide range of natural distribution families are succinctly decompos¬ 
able, and provide efficient projection algorithms for each class. 


4.1 Existence of Structural Decompositions 

Theorem 4.1 (Monotonicity). For all 7 > 0, the class Ai of monotone distributions on [n] is 
(7, L)-splittable for L O ^ lo ^ — ^ . 

Note that this proof can already be found in [BKR04, Theorem 10], interwoven with the analysis 
of their algorithm. For the sake of being self-contained, we reproduce the structural part of their 
argument, removing its algorithmic aspects: 


Proof of Theorem f.l. We define the X recursively as follows: X^ = ([l,n]), and for j > 0 the 
partition X^ +1 ) is obtained from = (/j J \ ..., Ipp by going over the Ip' 1 = [ap , b^j in order, 

and: 


(a) 

(b) 

(c) 


if D(I^) < X, then 1 ^ is added as element of X^ +1 ) (“marked as leaf”); 


rO) 


else, if D{b^p) < (1 + 7 )D(aP), then Ip 1 is added as element of X^+b (“marked as leaf”); 


,Uh 


r(J) 


otherwise, bisect 1 ^ in iP, iP (with 
elements of X^ +1 L 


r (i) 


/O') 


/2 


and add both Ip and pp as 


6 In more detail, we want to argue that if D is in the class, then a decomposition with at most L pieces is found 

by the algorithm. Since there is a dyadic decomposition with at most L pieces (namely, 1 ( 7 ,D) = (Ii,... ,It )), it 

suffices to argue that the algorithm will never split one of the If s (as every single Ij will eventually be considered 

by the recursive binary splitting, unless the algorithm stopped recursing in this “path” before even considering Ij, 

which is even better). But this is the case by the above argument, which ensures each such Ij will be recognized as 
satisfying one of the two conditions for “good decomposition” (being either close to uniform in £ 2 , or having very 
little mass). 
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and repeat until convergence (that is, whenever the last item is not applied for any of the intervals). 
Clearly, this process is well-defined, and will eventually terminate (as (Ij)j is a non-decreasing 
sequence of natural numbers, upper bounded by n). Let X = (Ii,..., If) (with = [a*, a, + i)) be 
its outcome, so that the If s are consecutive intervals all satisfying either (a) or (b). As (b) clearly 
implies (ii), we only need to show that I < X; for this purpose, we shall leverage as in [BKR04] the 
fact that D is monotone to bound the number of recursion steps. 

The recursion above defines a complete binary tree (with the leaves being the intervals satisfying 
(a) or (b), and the internal nodes the other ones). Let t be the number of recursion steps the process 
goes through before converging to X (height of the tree); as mentioned above, we have t < logn (as 
we start with an interval of size n, and the length is halved at each step.). Observe further that if 
at any point an interval = [a ^, &*+] has D(ap) < -^, then it immediately (as well as all the 

4 j) ’s for k > i by monotonicity) satisfies (a) and is no longer split (“becomes a leaf”). So at any 
j < t, the number of intervals ij for which neither (a) nor (b) holds must satisfy 

1 > D ( a [ j) ) > (1 + 7 ) D { a ( i ) ) > (1 + 7 ) 2 X>(4 i} ) > • • • > (1 + 7 )^- 1 X>(4' ) ) > (1 + 

J TLIj 

where a*, denotes the beginning of the k -th interval (again we use monotonicity to argue that the 

log 

extrema were reached at the ends of each interval), so that ij < 1 + 7 ^ 777 - I n particular, the 
total number of internal nodes is then 


t 

yy ^ ■ 

i=1 


^ + log(l + 7 )j 


(l + o(l)) 


log 2 n 
log(l + 7 ) 


< L . 


This implies the same bound on the number of leaves i. □ 

Corollary 4.2 (Unimodality). For all 7 > 0, the class of unimodal distributions on [n] is 
( 7 , L)-decomposable for L = f O . 

Proof. For any D S Mi, [n] can be partitioned in two intervals I, J such that Dj , Dj are either 
monotone non-increasing or non-decreasing. Applying Theorem 4.1 to X>/ and Dj and taking the 
union of both partitions yields a (no longer necessarily dyadic) partition of [n]. □ 

The same argument yields an analogue statement for f-modal distributions: 

Corollary 4.3 (t-modality). For any t > 1 and all 7 > 0, the class A4 1 of t-modal distributions 
on [n] is (7 , L)-decomposable for L = f O ^ - lo ^ n ) . 

Corollary 4.4 (Log-concavity, concavity and convexity). For all 7 > 0, the classes C , Kr and /C + 
of log-concave, concave and convex distributions on [ n ] are (7, L)-decomposable for L 0 ^ lo ^ n \ . 

Proof. This is directly implied by Corollary 4.2, recalling that log-concave, concave and convex 
distributions are unimodal. □ 


Theorem 4.5 (Monotone Hazard Rate). For all 7 > 0, the class A4HIZ of MHR distributions on 
\n] is ( 7 , X )-decomposable for L 
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Proof. This follows from adapting the proof of [CDSS13], which establishes that every MHR dis¬ 
tribution can be approximated in t\ distance by a O (log(n/e)/e)-histogram. For completeness, we 
reproduce their argument, suitably modified to our purposes, in Appendix B. □ 

Theorem 4.6 (Piecewise Polynomials). For all 7 > 0, t, d > 0, the class Vt.d of t-piecewise degree- 
d distributions on [n] is (7, L)-decomposable for L = f log 2 nj. (Moreover, for the class of 

t-histograms Fit (d = 0) one can take L = t.) 

Proof. The last part of the statement is obvious, so we focus on the first claim. Observing that 
each of the t pieces of a distribution D € Vt d can be subdivided in at most d + 1 intervals on which 
D is monotone (being degree-d polynomial on each such pieces), we obtain a partition of [n] into 
at most t(d + 1) intervals. D being monotone on each of them, we can apply an argument almost 
identical to that of Theorem 4.1 to argue that each interval can be further split into 0(log 2 n/'y) 
subintervals, yielding a good decomposition with 0 (t(d + 1 ) log 2 n/ 7 ) pieces. □ 

4.2 Projection Step: computing the distances 

This section contains details of the distance estimation procedures for these classes, required in 
the last stage of Algorithm 1. (Note that some of these results are phrased in terms of distance 
approximation, as estimating the distance (\ (D,C) to sufficient accuracy in particular yields an 
algorithm for this stage.) 

We focus in this section on achieving the sample complexities stated in Corollary 1.2, Corollary 1.3, 
and Corollary 1.4. While almost all the distance estimation procedures we give in this section are 
efficient, running in time polynomial in all the parameters or even with only a polylogarithmic 
dependence on n, there are two exceptions - namely, the procedures for monotone hazard rate 
(Lemma 4.9) and log-concave (Lemma 4.10) distributions. We do describe computationally ef¬ 
ficient procedures for these two cases as well in Section 4.2.1, at a modest additive cost in the 
sample complexity. 

Lemma 4.7 (Monotonicity [BKR04, Lemma 8]). There exists a procedure ProjectionDistm 
that, on input n as well as the full (succinct) specification of a (.-histogram D on [n\, computes the 
(exact) distance (\(D,M) in time poly(^). 

A straightforward modification of the algorithm above (e.g., by adapting the underlying linear 
program to take as input the location m € [(} of the mode of the distribution; then trying all l 
possibilities, running the subroutine ( times and picking the minimum value) results in a similar 
claim for unimodal distributions: 

Lemma 4.8 (Unimodality). There exists a procedure ProjectionDistmj that, on input n as 
well as the full (succinct) specification of a (-histogram D on [n], computes the (exact) distance 
(\(D,M\) in time poly(-f). 

A similar result can easily be obtained for the class of t-modal distributions as well, with a 
poly(£, f)-time algorithm based on a combination of dynamic and linear programming. Analogous 
statements hold for the classes of concave and convex distributions /C + ,/C~, also based on linear 
programming (specifically, on running 0 (n 2 ) different linear programs - one for each possible 
support [a, b\ C [n] - and taking the minimum over them). 
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Lemma 4.9 (MHR). There exists a (non-efficient) procedure ProjectionDista/j-^. that, on input 
n, e, as well as the full specification of a distribution D on [n], distinguishes between £\{D, MLLIt) < 
e and £\(D , JAWlZ) > 2e in time 2° e ( n \ 

Lemma 4.10 (Log-concavity). There exists a (non-efficient) procedure ProjectionDist£ that, 
on input n, £, as well as the full specification of a distribution D on [n], distinguishes between 
t\ (D,C) < £ and (D,C) > 2e in time 2° s( ' n ' ) . 

Lemma f.9 and Lemma f.10. We here give a naive algorithm for these two problems, based on 
an exhaustive search over a (huge) e-cover S of distributions over [n]. Essentially, S contains 
all possible distributions whose probabilities pi,... ,p n are of the form je/n, for j € { 0 ,... ,n/e} 
(so that |<S| = 0((n/e) n )). It is not hard to see that this indeed defines an e-cover of the set 
of all distributions, and moreover that it can be computed in time poly(|«S|). To approximate 
the distance from an explicit distribution D to the class C (either M.LTR, or £), it is enough to 
go over every element S' of 5, checking (this time, efficiently) if ||S — < e and if there is 

a distribution P E C close to S (this time, pointwise, that is \P(i) — S(i)| < e/n for all i) - 
which also implies ||S — P|| x < e and thus \\P — < 2e. The test for pointwise closeness can 

be done by checking feasibility of a linear program with variables corresponding to the logarithm 
of probabilities, i.e. x t = In P(i). Indeed, this formulation allows to rephrase the log-concave 
and MHR constraints as linear constraints, and pointwise approximation is simply enforcing that 
ln(S'(i) — e/n) < Xi < ln(5(i) +e/n) for all i. At the end of this enumeration, the procedure accepts 
if and only if for some S both \\S — < e and the corresponding linear program was feasible. □ 

Lemma 4.11 (Piecewise Polynomials). There exists a procedure PROJECTiONDiST-p t d that, on 
input n as well as the full specification of an t-histogram D on [n], computes an approximation 
A of the distance £\(D,Vt,d) such that £\(D,Vt,d) < A < 2>l\(D,V t f) + £, and runs in time 
0(n 3 ) ■ poly (£,t,d, ~). 

Moreover, for the special case of t-histograms (d = 0) there exists a procedure ProjectionDist-^, 
which, given inputs as above, computes an approximation A of the distance £i(D,Tlt) such that 
£\(DjUt) < A < (D,TLt) + £> and runs in time poly (£,t, ^), independent of n. 

Proof. We begin with PROJECTiONDlST% t . Fix any distribution D on [n]. Given any explicit 
partition of [n] into intervals T = (R,..., If), one can easily show that ||D — $(D,X)|| 1 < 2 optx, 
where OPTj is the optimal distance of D to any histogram on X. To get a 2-approximation of 
£\ (D, Lit), it thus suffices to find the minimum, over all possible partitionings X of [n] into t intervals, 
of the quantity ||D — < h(Zl,X)|| 1 (which itself can be computed in time T = 0(mm(t£,n))). By a 
simple dynamic programming approach, this can be performed in time 0(tn 2 • T). The quadratic 
dependence on n, which follows from allowing the endpoints of the t intervals to be at any point of 
the domain, is however far from optimal and can be reduced to (t/e) 2 , as we show below. 

For f] > 0, define an r/-granular decomposition of a distribution D over [n] to be a partition of 
[n] into s = 0(l/rj) intervals J\,... ,J S such that each interval Jj is either a singleton or satisfies 
D{Jf) < 7], (Note that if D is a known Ahistogram, one can compute an r/-granular decomposition 
of D in time 0{£/rj) in a greedy fashion.) 

Claim 4.12. Let D be a distribution over [n ], and J = (Ji,..., J s ) be an g-granular decomposition 
of D (with s > t). Then, there exists a partition of [n] into t intervals X = (I\,... ,I t ) and a t- 
histogram H on X such that \\D — H\\ { < 2£±(D,LLt) + 2 trj, and X is a coarsening of J. 
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Before proving it, we describe how this will enable us to get the desired time complexity for 
ProjectionDist^j . Phrased differently, the claim above allows us to run our dynamic program 
using the 0 (l/rf) endpoints of the 0 ( 1 / 77 ) instead of the n points of the domain, paying only an 
additive error 0(ti 7 ). Setting 77 = the guarantee for ProjectionDist-^ follows. 

Proof of Claim f.12. Let J = (Ji,..., J s ) be an ? 7 -granular decomposition of D, and H* € Tit be 
a histogram achieving opt = t\(D,Pt)- Denote further by Z* = (If ,...,//) the partition of [ 71 ] 
corresponding to H *. Consider now the r < t endpoints of the Iff s that do not fall on one of the 
endpoints of the Jf s: let Jj 15 ..., Ji r be the respective intervals in which they fall (in particular, 
these cannot be singleton intervals), and S = their union. By definition of 77 -granularity, 

D(S) < tr), and it follows that H*(S ) < trj + ^opt. We define H from H* in two stages: first, we 
obtain a (sub)distribution H' by modifying H* on S, setting for each x € Ji j the value of H to be 
the minimum value (among the two options) that H* takes on J %j . H' is thus a t-histogram, and 
the endpoints of its intervals are endpoints of J as wished; but it may not sum to one. However, 
by construction we have that H'([n\) > 1 — H*(S ) > 1 — tr] — |opt. Using this, we can finally 
define our t-histogram distribution H as the renormalization of H'. It is easy to check that H is a 
valid t-histogram on a coarsening of J , and 

\\D — H\\ 1 < WD-Hff + (1 - H'([n])) < \\D - H*\\ 1 + \\H* - H'\\ l + tr 1 + ^ opt < 2opt + 2t? ? 

as stated. □ 

Turning now to ProjectionDist^ d , we apply the same initial dynamic programming ap¬ 
proach, which will result on a running time of 0(n 2 t ■ T ), where T is the time required to estimate 
(to sufficient accuracy) the distance of a given (sub)distribution over an interval I onto the space V<i 
of degree-d polynomials. Specifically, we will invoke the following result, adapted from [CDSS14a] 
to our setting: 

Theorem 4.13. Letp be a I-histogram over [—1,1). There is an algorithm PROJECTSiNGLEPOLY(d, 77 ) 
which runs in time poly(£, d + l,l/rj), and outputs a degree-d polynomial q which defines a pdf over 
[—1,1) such that \\p — gUj < 3£i(p, Vd) + 0( 77 ). 

The proof of this modification of [CDSS14a, Theorem 9] is deferred to Appendix C. Applying 
it as a blackbox with 77 set to 0 (e/t) and noting that computing the l\ distance to our explicit 
distribution on a given interval of the degree-d polynomial returned incurs an additional 0 (n) 
factor, we obtain the claimed guarantee and running time. □ 

4.2.1 Computationally Efficient Procedures for Log-concave and MHR Distributions 

We now describe how to obtain efficient testing for the classes C and A iTLIZ - that is, how to obtain 
polynomial-time distance estimation procedures for these two classes, unlike the ones described in 
the previous section. At a very high-level, the idea is in both case to write down a linear program on 
variables related logarithmically to the probabilities we are searching, as enforcing the log-concave 
and MHR constraints on these new variables can be done linearly. The catch now becomes the l\ 
objective function (and, to a lesser extent, the fact that the probabilities must sum to one), now 
highly non-linear. 
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The first insight is to leverage the structure of log-concave (resp. monotone hazard rate) dis¬ 
tributions to express this objective as slightly stronger constraints, specifically pointwise (1 ± e)- 
multiplicative closeness, much easier to enforce in our “logarithmic formulation.” Even so, doing 
this naively fails, essentially because of a too weak distance guarantee between our explicit his¬ 
togram D and the unknown distribution we are trying to find: in the completeness case, we are 
only promised e-closeness in t \, while we would also require good additive pointwise closeness of 
the order e 2 or e 3 . 

The second insight is thus to observe that we “almost” have this for free: indeed, if we do 
not reject in the first stage of the testing algorithm, we do obtain an explicit /c-histogram D with 
the guarantee that D is e-close to the distribution P to test. However, we also implicitly have 
another distribution D' that is \Je/k- close to P in Kolmogorov distance: as in the recursive descent 
we take enough samples to use the DKW inequality (Theorem 2.10) with this parameter, i.e. an 
additive overhead of 0{k/e ) samples (on top of the d(Vkn/ e 7 Z 2 )). If we are willing to increase this 
overhead by just a small amount, that is to take 0(max(t’/e, 1/e 4 )), we can guarantee that D' be 
also 0(e 2 )-close to P in Kolmogorov distance. 

Combining these ideas yield the following distance estimation lemmas: 

Lemma 4.14 (Monotone Hazard Rate). There exists a procedure ProjectionDist^^-^ that, on 
input n as well as the full specification of a k-histogram distribution D on [n] and of a t-histogram 
distribution D' on [n], runs in time poly(n, 1/e), and satisfies the following. 

• If there is P € MWZ such that \\D — P^ < e and \\D' — P|| Kol < e 3 , then the procedure 
returns yes; 

• If M.TL1Z) > lOOe, then the procedure returns no. 

Lemma 4.15 (Log-concavity). There exists a procedure ProjectionDist^ that, on input n as 
well as the full specifications of a k-histogram distribution D on [n] and a l-histogram distribution 
D' on [n], runs in time poly(n, k, t, 1/e), and satisfies the following. 

2 

• If there is P € C such that \\D — P^ < e and \\D' ~ P 11 Kol < . then the procedure 

returns yes; 

• Ifh{D,C) > lOOe, then the procedure returns no. 

The proofs of these two lemmas are quite technical and deferred to Appendix C. With these in 
hand, a simple modification of our main algorithm (specifically, setting rn = 0(max(y / ]7[/e 3 L, P 2 /e 2 , l/ff c )) 
for c either 4 or 6 instead of 0(max( v /]T[/e 3 P, P 2 /e 2 )), to get the desired Kolmogorov distance 
guarantee; and providing the empirical histogram defined by these m samples along to the distance 
estimation procedure) suffices to obtain the following counterpart to Corollary 1.2: 

Corollary 4.16. The algorithm TestSplittable, after this modification, can efficiently test the 
classes of log-concave and monotone hazard rate (MHR) distributions, with respectively 0(y/n./s 7 ! 2 + 1/e 4 ) 
and 0(y/n/e 7//2 + 1/e 6 ) samples. 

5 Going Further: Reducing the Support Size 

The general approach we have been following so far gives, out-of-the-box, an efficient testing al¬ 
gorithm with sample complexity 0(y/n) for a large range of properties. However, this sample 
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complexity can for some classes V be brought down a lot more, by taking advantage in a prepro¬ 
cessing step of good concentration guarantees of distributions in V. 

As a motivating example, consider the class of Poisson Binomial Distributions (PBD). It is well- 
known (see e.g. [KG71, Section 2]) that PBDs are unimodal, and more specifically that VBV n C 
£ C .Mi. Therefore, using our generic framework we can test Poisson Binomial Distributions with 
0(y/n) samples. This is, however, far from optimal: as shown in [AD15], a sample complexity 
of is both necessary and sufficient. The reason our general algorithm ends up making 

quadratically too many queries can be explained as follows. PBDs are tightly concentrated around 
their expectation, so that they “morally” live on a support of size m = 0{y/n). Yet, instead of 
testing them on this very small support, in the above we still consider the entire range [n], and 
thus end up paying a dependence y/n - instead of y/m. 

If we could use that observation to first reduce the domain to the effective support of the 
distribution, then we could call our testing algorithm on this reduced domain of size 0(y/n). In the 
rest of this section, we formalize and develop this idea, and in Section 5.2 will obtain as a direct 
application a -query testing algorithm for VBV n . 

Definition 5.1. Given e > 0, the e-effective support of a distribution D is the smallest interval I 
such that D(I ) > 1 — e. 

The last definition we shall require is of the conditioned distributions of a class C: 

Definition 5.2. For any class of distributions C over [n], define the set of conditioned distributions 
of C (with respect to e > 0 and interval / C [n]) as C 6,1 = f { Di : D € C, D(I) > 1 — e }. 

Finally, we will require the following simple result: 


Lemma 5.3. Let D be a distribution over [n], and I C [n] an interval such that D(I) > 1 — 
Then, 

• If D € C, then Dj € Cw ,T ; 

• If £i(D,C) > e, then I\(Di,CwI) > 


Proof. The first item is obvious. As for the second, let P E C be any distribution with P(I ) > 1 — 
By assumption, ||D — P|| x > e: but we have, writing a = 1/10, 


£ 

10 - 


iei 

1 


D(i) P(i) 


D(I) P(I) 




> 




1 - 


D(I) 


P(I) 


iei 


Djy( £ | D(i) - P(i) | - | P(I) - D(I) |) > Djy ( £ I D(i) - P(i) | - ae) 
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We now proceed to state and prove our result - namely, efficient testing of structured classes of 
distributions with nice concentration properties. 

Theorem 5.4. Let C be a class of distributions over [n] for which the following holds. 

1. there is a function M (-, ■) such that each D GC has e-effective support of size at most M(n,e); 

2. for every e € [0,1] and interval I C [n], C £)I is (7, L)-splittable; 

3. there exists an efficient procedure ProjectionDist ce ,j which, given as input the explicit de¬ 
scription of a distribution D over [n] and interval I C [n], computes the distance £i(Dj,C £ ’ 1 ). 

Then, the algorithm TestEffectiveSplittable (Algorithm 2) is a O^max^Vuiilogm, - 
sample tester for C, where m = M(n, + ) and l = 


Algorithm 2 TestEffectiveSplittable 

Require: Domain D (interval of size n), sample access to D over f2; subroutine 

ProjectionDist C£ ,/ 

Input: Parameters e € (0,1], function L(-,-), and upper bound function M(-, ■) for the effective 
support of the class C. 

1 : Set m = f 0( 1/e 2 ), r c = M(n, ^j). 

2: Effective Support 

3: Compute D , an empirical estimate of D, by drawing m independent samples from Lb 

4: Let J be the largest interval of the form { 1 ,..., j} such that D(J) < 

5: Let K be the largest interval of the form {k,... , n} such that D(K) < 

6 : Set I <r- [n] \ (J U K). 

7: if |/| > r then return REJECT 

8 : end if 

9: 

10: Testing 

11: Call TestSplittable with / (providing simulated access to Dj by rejection sampling, 

returning FAIL if the number of samples q from Dj required by the subroutine is not obtained 
after 0(q) samples from D), ProjectionDist C£ ,j, parameters s' = f and L(-,-). 

12: return ACCEPT if TestSplittable accepts, REJECT otherwise. 

13: 


5.1 Proof of Theorem 5.4 

By the choice of m and the DKW inequality, with probability at least 23/24 the estimate D satisfies 
||Z? — D || Ko1 < Conditioning on that from now on, we get that D(I) > D(I) — > 1 — 

Furthermore, denoting by j and k the two inner endpoints of J and K in Steps 4 and 5, we have 
D(J U {j + 1}) > D(J U {j + 1}) — (fj > (fj (similarly for D(K U {k — 1})), so that I has size at 
most o + l, where o is the ^-effective support size of D. 

Finally, note that since D(I) = 0(1) by our conditioning, the simulation of samples by rejection 
sampling will succeed with probability at least 23/24 and the algorithm will not output FAIL. 

Sample complexity. The sample complexity is the sum of the 0( 1/e 2 ) in Step 3 and the 0(q ) in 
Step 11. From Theorem 1.1 and the choice of /, this latter quantity is O f max (71 v 7 M £ log M, ^j)J 
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where M = M(n , and l = L(j^,M(n, ^)). 


Correctness. If D £ C, then by the setting of r (set to be an upper bound on the ^-effective sup¬ 
port size of any distribution in C) the algorithm will go beyond Step 6 . The call to TestSplittable 
will then end up in the algorithm returning ACCEPT in Step 12, with probability at least 2/3 
by Lemma 5.3, Theorem 1.1 and our choice of parameters. 

Similarly, if D is e-far from C, then either its effective support is too large (and then the test 
on Step 6 fails), or the main tester will detect that its conditional distribution on / is ||-far from 
C and output REJECT in Step 12. 

Overall, in either case the algorithm is correct except with probability at most 1/24 + 1/24 + 
1/3 = 5/12 (by a union bound). Repeating constantly many times and outputting the majority 
vote brings the probability of failure down to 1/3. □ 


5.2 Application: Testing Poisson Binomial Distributions 


In this section, we illustrate the use of our generic two-stage approach to test the class of Poisson 
Binomial Distributions. Specifically, we prove the following result: 


Corollary 5.5. The class of Poisson Binomial Distributions can be tested with 0[n 1 ^ /e 7 / 2 
o( log 4 n/e 4 ) samples, using Algorithm 2. 


+ 


This is a direct consequence of Theorem 5.4 and the lemmas below. The first one states that, 
indeed, PBDs have small effective support: 


Fact 5.6. 


For any s > 0, a PBD has e-effective support of size O(^/n\og{\/e)^j . 


Proof. By an additive Chernoff Bound, any random variable X following a Poisson Binomial Distri¬ 
bution has Pr [\X — EX| > yn] < 2e~' 2l2n . Taking 7 = f yf In |, we get that Pr [X £ I] > 1 — e, 

where / d = [E X - y[\ Inf, E A + In §]. □ 


It is clear that if D £ VBD n (and therefore is unimodal), then for any interval / C [n] the 
conditional distribution Dj is still unimodal, and thus the class of conditioned PBDs VBDf r = f 
{ Di : D £ VBV n , D(I) > 1 — e } falls under Corollary 4.2. The last piece we need to apply our 
generic testing framework is the existence of an algorithm to compute the distance between an 
(explicit) distribution and the class of conditioned PBDs. This is provided by our next lemma: 


Claim 5.7. There exists a procedure PROJECTiONDiSTp^e,/ that, on input n and e, € [0,1], 
/ C [n] as well as the full specification of a distribution D on [n], computes a value r such that 
t € [1 ± 2e] • i\(D^VBVf 1 ) =E in time n 2 {1/e)°^ ogl ^ £ \ 

Proof. The goal is to find a 7 = ©(e)-approximation of the minimum value of X/e/ 
subject to P(I) = Y^iei P(*) > 1 — £ and P £ VBV n . We first note that, given the parameters 
n € N and pi,... ,p n £ [0,1] of a PBD P, the vector of (n + 1) probabilities P(0),..., P{n) can be 
obtained in time 0(n 2 ) by dynamic programming. Therefore, computing the t\ distance between 
D and any PBD with known parameters can be done efficiently. To conclude, we invoke a result 
of Diakonikolas, Kane, and Stewart, that guarantees the existence of a succinct (proper) cover of 
VBD n . 


PO) DU) 

P(I) D(I) » 


21 









Theorem 5.8 ([DKS15, Theorem 14] (rephrased)). For all n, 7 > 0, there exists a set Sy C VBT> n 
such that: 

(i) Sy is a 'y-cover of VBV n ; that is, for all D € VBV n there exists some D' € Sy such that 
\\D-D' ||i <7 

(ii) |<S 7 | < n 

(in) Sy can he computed in time n(I/ t) 0 ^ 0 ® 1 ^ 
and each D £ Sy is explicitly described by its set of parameters. 

We further observe that the factor n in both the size of the cover and running time can be easily 
removed in our case, as we know a good approximation of the support size of the candidate PBDs. 
(That is, we only need to enumerate over a subset of the cover of [DKS15], that of the PBDs with 
effective support compatible with our distribution D.) 

Set 7 ( = 2 §q. Fix P € VBV n such that P(I) > 1 — s, and Q £ S 7 such that ||P — Q\\ 1 < 7 . 
In particular, it is easy to see via the correspondence between t\ and total variation distance that 
| P(I) — Q(I)\ < 7 / 2 . By a calculation analogue as in Lemma 5.3, we have 


\\P1-Q1 lli = E 

id 

= E 

id 

1 


P(i) Q(i ) 


m Q(i) 

P(i ) Q(i) 


= E 


P(i) Q(i) 


p(i) m 


id 


P{I) P(I) 

1 1 


+ Q(*) 


1 


1 


id 


Y J \P(i)-Q(i)\± 1 -\ = 


m Q(i) 

1 


P{i) Q(i) 


-Q(i)\±\P(I)-Q(I)\ 


P^Ktti 2 J 

e [||P — Qll, - 57/2, (1 + 2e) (||P — QHi H- 57/2)] 




where we used the fact that £^7 \P{i) - Q(i)\ = 2 p(i)>Q{i)( p (i) ~ Q(*))) +QU) - P(I) € 

[— 27 , 27 ]. By the triangle inequality, this implies that the minimum of \\Pj — Dj\\ l over the distri¬ 
butions P of S s with P(7) > 1 — (e + 7 / 2 ) will be within an additive 0(e) of £i(D,VBT>^ 1 ). The 
fact that the former can be done in time poly(n) • (l/e) 0 ^ 0 ® concludes the proof. □ 

As previously mentioned, this approximation guarantee for £i(D,VB'D)y I ) is sufficient for the 
purpose of Algorithm 1. 

Proof of Corollary 5.5. Combining the above, we invoke Theorem 5.4 with M(n, e) = 0(y/n log(l/e)) 
(Fact 5.6) and P(m, 7 ) = O ( log m ) (Corollary 4.2). This yields the claimed sample complexity; fi¬ 
nally, the efficiency is a direct consequence of Claim 5.7. □ 


6 Lower Bounds 

6.1 Reduction-based Lower Bound Approach 

We now turn to proving converses to our positive results - namely, that many of the upper bounds 
we obtain cannot be significantly improved upon. As in our algorithmic approach, we describe for 
this purpose a generic framework for obtaining lower bounds. 
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In order to state our results, we will require the usual definition of agnostic learning. Recall 
that an algorithm is said to be a semi-agnostic learner for a class C if it satisfies the following. 
Given sample access to an arbitrary distribution D and parameter e, it outputs a hypothesis D 
which (with high probability) does “almost as well as it gets”: 

||D - < c • opt c ,d + 0{e) 

where OPT^zj = infD'ec ^i{D', D), and c > 1 is some absolute constant (if c = 1, the learner is 
said to be agnostic). 

High-level idea. The motivation for our result is the observation of [BKR04] that “nronotonicity 
is at least as hard as uniformity.” Unfortunately, their specific argument does not generalize easily 
to other classes of distributions, making it impossible to extend it readily. The starting point of our 
approach is to observe that while uniformity testing is hard in general, it becomes very easy under 
the promise that the distribution is monotone, or even only close to monotone (namely, 0(l/e 2 ) 
samples suffice). This can give an alternate proof of the lower bound for monotonicity testing, via 
a different reduction: first, test if the unknown distribution is monotone; if it is, test whether it is 
uniform, now assuming closeness to monotone. 

More generally, this idea applies to any class C which (a) contains the uniform distribution, and 
(b) for which we have a o(-yTi)-sample agnostic learner £, as follows. Assuming we have a tester T 
for C with sample complexity o(y / n), define a uniformity tester as below. 

• test if D € C using T; if not, reject (as 11 G C, D cannot be uniform); 

• otherwise, agnostically learn D with £ (since D is close to C), and obtain hypothesis D; 

• check offline if D is close to uniform. 

By assumption, T and £ each use o(y / re) samples, so does the whole process; but this contradicts 
the lower bound of [BFR + 00, Pan08] on uniformity testing. Hence, T must use Ll(y/n) samples. 

This “testing-by-narrowing” reduction argument can be further extended to other properties 
than to uniformity, as we show below: 

Theorem 6.1. Let C be a class of distributions over [n] for which the following holds: 

(i) there exists a semi-agnostic learner C forC, with sample complexity qL(n,e,5) and “agnostic 
constant ’’ c; 

(ii) there exists a subclass Cnard U C such that testing Cnard requires qn(n,£) samples. 

Suppose further that gr(n, e, 1/10) = o(g//(n,e)). Then, any tester for C must use U(gzz(n,e)) 
samples. 

Proof. The above theorem relies on the reduction outlined above, which we rigorously detail here. 
Assuming C, Cnard, £ as above (with semi-agnostic constant c > 1), and a tester T for C with 
sample complexity qr[n, e), we define a tester 7H ar d for CHard- On input e € (0,1] and given sample 
access to a distribution D on [n], 7H ar d acts as follows: 

• call T with parameters n, ^ (where e' = |) and failure probability 1/6, to ^--test if D G C. 
If not, reject. 

• otherwise, agnostically learn a hypothesis D for D, with £ called with parameters n, e' and 
failure probability 1/6; 
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• check offline if D is enclose to Cnard, accept if and only if this is the case. 

We condition on both calls (to T and C) to be successful, which overall happens with probability 
at least 2/3 by a union bound. The completeness is immediate: if D S Cnard C C, T accepts, and 
the hypothesis D satisfies ||D — < e'. Therefore, t\ (l),Cpard) < s', and 7H ar d accepts. 

For the soundness, we proceed by contrapositive. Suppose Tkard accepts; it means that each 
step was successful. In particular, i\ (D,C) < e'/c; so that the hypothesis outputted by the agnostic 
learner satisfies \\D — < c- OPT + e' < 2e'. In turn, since the last step passed and by a triangle 

inequality we get, as claimed, £i(D,Cu_a,rd) < 2e 7 + £±(D, Cnard) < 3c' = e. 

Observing that the overall sample complexity is qr{n, + , ^j) = Qr{n, 7 ) + o(< 7 //(n, e')) 

concludes the proof. □ 

Taking Cn a rd to be the singleton consisting of the uniform distribution, and from the semi¬ 
agnostic learners of [CDSS13, CDSS14a] (each with sample complexity either poly(l/e) or poly(logn, 1/e)), 
we obtain the following : 7 

Corollary 1.6. Testing log-concavity, convexity, concavity, MHR, unimodality, t-modality, t-histograms, 
and t-piecewise degree-d distributions each require 0 (- v /n/e 2 ) samples (the last three for t = o(y/n) 
and t(d + 1 ) = o(y / re), respectively), for any e > l/n°( l \ 

Similarly, we can use another result of [DDS12b] which shows how to agnostically learn Poisson 
Binomial Distributions with 0(l/e 2 ) samples . 8 Taking Cnard to be the single Bin(n, 1/2) distribu¬ 
tion (along with the testing lower bound of [VV14]), this yields the following: 

Corollary 1.7. Testing the classes of Binomial and Poisson Binomial Distributions each require 
D^n 1 / 4 /e 2 ^ samples, for any e > 1 /n 0 ^. 

Finally, we derive a lower bound on testing fe-SIIRVs from the agnostic learner of [DDO + 13] (which 
has sample complexity poly(fc, 1 /e) samples, independent of n): 

Corollary 1.8. There exist absolute constants c > 0 and £0 > 0 such that testing the class of 
k-SIIRV distributions requires D(fe 1 ,/ 2 n 1//4 ) samples, for any k = o(n c ) and e < £q. 

Corollary 1.8. To prove this result, it is enough by Theorem 6.1 to exhibit a particular fc-SIIRV 
S such that testing identity to S requires this many samples. Moreover, from [VV14] this last 
part amounts to proving that the (truncated) 2/3-norm ||5r £ ™ ax || 2 y 3 of S is (for some 

small eo > 0). Our hard instance S will be defined as follows: it is defined as the distribution of 
X\ + • • • + X n , where the Xfs are independent integer random variables uniform on {0,..., k — 1} 

(in particular, for k = 2 we get a Bin(n, 1/2) distribution). It is straightforward to verify that 

ES = and cr 2 = f Var S = ^ ( 2 1 ' >w = O (k 2 n)] moreover, S is log-concave (as the convolution 

of n uniform distributions). From this last point, we get that (i) the maximum probability of S, 
attained at its mode, is ||<S '|| 00 = 0(1/<t); and (ii) for every j in an interval I of length 2 a centered 

7 Specifically, these lower bounds hold as long as e = 12(1 /n“) for some absolute constant a > 0 (so that the sample 
complexity of the agnostic learner is indeed negligible in front of y/n/e 2 ). 

8 Note the quasi-quadratic dependence on e of the learner, which allows us to get e into our lower bound for 
n > poly log(l/e). 
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at this mode, S(j) > ^(||<S'|| 00 ). Putting this together, we get that the 2/3-norm (and similarly the 
truncated 2/3-norm) of S is lower bounded by 

( E 5 C ?') 2/3 ) 3/2 > ( 2ct • ^ d /-) 2/3 ) 3/2 = ^( V /2 ) = tt ( V / 2 n 1/4 ) 

jei 

which concludes the proof. □ 

6.2 Tolerant Testing 

This lower bound framework from the previous section carries to tolerant testing as well, resulting 
in this analogue to Theorem 6.1: 

Theorem 6.2. Let C be a class of distributions over [n ] for which the following holds: 

(i) there exists a semi-agnostic learner C forC, with sample complexity qi,(n,£,5) and “agnostic 
constant ’’ c; 

(ii) there exists a subclass Cnard C C such that tolerant testing Cjiard requires qH(n, £i,£ 2 ) samples 
for some parameters £2 > (4c + l)ei. 

Suppose further that qi,(n,£ 2 — £i,l/10) = o(qn(n, £ 1 , £ 2 ))- Then, any tolerant tester for C must 
use Q(qjj(n,£ 1 , 62 )) samples (for some explicit parameters e\,^ 2 )- 

Proof. The argument follows the same ideas as for Theorem 6.1, up to the details of the parameters. 
Assuming C , Cnardj £ as above (with semi-agnostic constant c > 1 ), and a tolerant tester T 
for C with sample complexity q(n,£ 1 , 62 ), we define a tolerant tester Tkard for Cnard- On input 
0 < £\ < £2 < 1 with £2 > (4c + l)ei, and given sample access to a distribution D on [n], Tkard 

acts as follows. After setting e\ = f £2 ~ £1 , e ' 2 c = £ 2 2 £1 , s' = f £2 ] / i £1 and r 6 £ 2 |'g° £l , 

£ t £ f 

• call T with parameters n, -f, -f and failure probability 1/6, to tolerantly test if D £ C. If 
l\ (D,C) > e' 2 /c, reject. 

• otherwise, agnostically learn a hypothesis D for D , with L called with parameters n, £ : and 
failure probability 1 / 6 ; 

• check offline if D is r-close to CHard, accept if and only if this is the case. 

We condition on both calls (to T and C) to be successful, which overall happens with probability 
at least 2/3 by a union bound. We first argue completeness: assume ^i(T),CHard) < £i- This 
implies £\(D,C) < £ 1 , so that T accepts as £\ < e \/c (which is the case because £2 > (4c + l)£i). 
Thus, the hypothesis D satisfies \\D — D Hi < c • £\/c + e' = e\ + e'. Therefore, ■^i(£ ) ,CHard) < 

|| D - D Hj + (D, Cjiard) < e'l + e' + £1 < r, and Tkard accepts. 

For the soundness, we again proceed by contrapositive. Suppose 7H ar d accepts; it means that 
each step was successful. In particular, i\(D,C) < £ 2 /c; so that the hypothesis outputted by the 
agnostic learner satisfies \\D — D \\ 1 < c • OPT + e' < £2 + s' ■ In turn, since the last step passed and 
by a triangle inequality we get, as claimed, £\(D. Cn a rd) < £2 + g/ + ^i(-D>CHard) < £2 +e' + t < £ 2 . 

£ f £ f . , 

Observing that the overall sample complexity is qrin, -f, -jf) + qLip, e', = qr(n,^) + 

o(g//(n,£ / )) concludes the proof. □ 

As before, we instantiate the general theorem to obtain specific lower bounds for tolerant testing 
of the classes we covered in this paper. That is, taking Cnard to be the singleton consisting of the 
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uniform distribution (combined with the tolerant testing lower bound of [VV10]), and again from 
the semi-agnostic learners of [CDSS13, CDSS14a] (each with sample complexity either poly(l/e) 
or poly (log n, 1/e)), we obtain the following: 

Corollary 1.11. Tolerant testing of log-concavity, convexity, concavity, MHR, unimodality, and 
t-modality each require TEgn) samples (the latter for t = o(n)). 

Similarly, we again turn to the class of Poisson Binomial Distributions, for which we can invoke 
as before the 0(l/£ 2 )-sample agnostic learner of [DDS12b]. As before, we would like to choose 
for CHard the single Bin(n, 1/2) distribution; however, as no tolerant testing lower bound for this 
distribution exists - to the best of our knowledge - in the literature, we first need to establish the 
lower bound we will rely upon: 

Theorem 6.3. There exists an absolute constant Eq > 0 such that the following holds. Any al¬ 
gorithm which, given sampling access to an unknown distribution D on P and parameter e € 
(0,£o), distinguishes with probability at least 2/3 between (i) ||-D — Bin(n, 1/2) 1^ < e and (ii) 
||D — Bin(?z, 1/2) || x > 100£ must use samples. 

The proof relies on a reduction from tolerant testing of uniformity, drawing on a result of 
Valiant and Valiant [VV10]; for the sake of conciseness, the details are deferred to Appendix D. 
With Theorem 6.3 in hand, we can apply Theorem 6.2 to obtain the desired lower bound: 

Corollary 1.12. Tolerant testing of the classes of Binomial and Poisson Binomial Distributions 
each require samples. 

We observe that both Corollary 1.11 and Corollary 1.12 are tight (with regard to the dependence 
on n), as proven in the next section (Section 7). 

7 A Generic Tolerant Testing Upper Bound 

To conclude this work, we address the question of tolerant testing of distribution classes. In the 
same spirit as before, we focus on describing a generic approach to obtain such bounds, in a clean 
conceptual manner. The most general statement of the result we prove in this section is stated 
below, which we then instantiate to match the lower bounds from Section 6.2: 

Theorem 7.1. Let C be a class of distributions over [n] for which the following holds: 

(i) there exists a semi-agnostic learner L for C, with sample complexity qL(n,£,5) and “agnostic 
constant” c; 

(ii) for any g e [0,1], every distribution in C has g-effective support of size at most M(n, g). 

Then, there exists an algorithm that, for any fixed tz > 1 and on input Ei,E 2 € (0,1) such that 
e 2 > Ce\, has the following guarantee (where C > 2 depends on c and k only). The algorithm 
takes 0( j^z logm ) + £2 ~ £l i ig) samples (where m = M{n,E\)), and with probability at 

least 2/3 distinguishes between (a) l\(D,C) < £1 and (b) t\(D,C) > £ 2 . (Moreover, one can take 
C =(1 + (5c + 6)^ t )./ 

Corollary 1.9. Tolerant testing of log-concavity, convexity, concavity, MHR, unimodality, and t- 
modality can be performed with 0( ^j^p Togn) samples, for £2 > Ce\ (where C > 2 is an absolute 
constant). 
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Applying now the theorem with M(n, e) = \Jn log(l/e) (as per Corollary 5.5), we obtain an 
improved upper bound for Binomial and Poisson Binomial distributions: 


Corollary 1.10. Tolerant testing of the classes of Binomial and Poisson Binomial Distributions 

can be performed with 0( ( £2 _ 1 £1 ^ samples, for £2 > Ce\ (where C > 2 is an absolute 

constant). 


High-level idea. Somewhat similar to the lower bound framework developed in Section 6, the 
gist of the approach is to reduce the problem of tolerant testing membership of D to the class C 
to that of tolerant testing identity to a known distribution - namely, the distribution D obtained 
after trying to agnostically learn D. Intuitively, an agnostic learner for C should result in a good 
enough hypothesis D (i.e., D close enough to both D and C) when D is ei-close to C ; but output 
a D that is significantly far from either D or C when D is £ 2 -far from C - sufficiently for us to 
be able to tell. Besides the many technical details one has to control for the parameters to work 
out, one key element is the use of a tolerant testing algorithm for closeness of two distributions 
due to [VVllb], whose (tight) sample complexity scales as n/logra for a domain of size n. In order 
to get the right dependence on the effective support (required in particular for Corollary 1.10), we 
have to perform a first test to identify the effective support of the distribution and check its size, 
in order to only call this tolerant closeness testing algorithm on this much smaller subset. (This 
additional preprocessing step itself has to be carefully done, and comes at the price of a slightly 
worse constant C = C(c, k,) in the statement of the theorem.) 

7.1 Proof of Theorem 7.1 

As described in the preceding section, the algorithm will rely on the ability to perform tolerant 
testing of equivalence between two unknown distributions (over some known domain of size m). 
This is ensured by an algorithm of Valiant and Valiant, restated below: 

Theorem 7.2 ([VVllb, Theorem 3 and 4]). There exists an algorithm £ which, given sampling 
access to two unknown distributions D 1 , D 2 over [rn], satisfies the following. On input e E (0,1], it 
takes 0(-p lo "' m ) samples from D 1 and D 2 , and outputs a value A such that \ \\Di — -D 2 II 1 — A| < e 
with probability 1 — 1/poly(m). (Furthermore, £ runs in time poly(m)./ 

For the proof, we will also need this fact, similar to Lemma 5.3, which relates the distance of two 
distributions to that of their conditional distributions on a subset of the domain: 

Fact 7.3. Let D and P be distributions over [n], and I C [n] an interval such that D(I ) > 1 — a 
and P(I ) >1-/3. Then, 

• ||Dj — Pi Hj < | — 3||H — (the last inequality for a < \); and 

• ||P/-PHIi> \\D-P\\ 1 -2(a + P). 
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Proof. To establish the first item, write: 


\\Oi- -P/ll, = E 

iei 

1 


D(i) P(i ) 


D(I) P(I) 


— Y 


D(i)-P(i) + P(i)( 


< 


c(/r D««- p c>i + 


1 - 


D(I) 


P(I) 


Y P d)) 

iei 


^jj(E W) - + \ p Y - D Y\ ) < ^7) ( E W) - p «l + 


i&I 

< —J-^'-llD-PlI 




D(I) 2' 

where we used the fact that | P(I) — D(J)\ < dxv(-C), P) = \\\D — P^. Turning now to the second 
item, we have: 


Pz--Prill = 


• - m § l D(i) “ p(i)+ P Y ~m) I - or ( § |D(i) “ F(,:)l “ 


l - 


m 


p (i) 


E p «) 

iei 


o|jy( E l D « - p «l - l p (') - °WI) > o|jy (E l D (0 * p «l - (“ + «) 


> Din 010 - P »1 - E i D » - p «)i - («+«) > ojjy 

> ||P-P|| 1 -2(a + /3). 


|P-P|| 1 -2(a + /3)) 


□ 


With these two ingredients, we are in position to establish our theorem: 

Proof of Theorem 7.1. The algorithm proceeds as follows, where we set e = f , 0 = £2 — ((6 + 
c)e 1 + He), and r = f 2 ^ 3+c ^ 1+5g : 

(1) using O(-p) samples, get (with probability at least 1 — 1/10, by Theorem 2.10) a distribution 
D |-close to D in Kolmogorov distance; and let I C [n] be the smallest interval such that 
D(I) > 1 - |ei - e. Output REJECT if \I\ > M(n,e 1 ). 

(2) invoke C on D with parameters e and failure probability jq, to obtain a hypothesis D ; 

(3) call £ (from Theorem 7.2) on Z)/, Dj with parameter | to get an estimate A of \\Dj — 

(4) output REJECT if D(I) < 1 — r; 

(5) compute “offline” (an estimate accurate within e of) t\(D,C), denoted A; 

(6) output REJECT is A + A > 0, and output ACCEPT otherwise. 

The claimed sample complexity is immediate from Steps (2) and (3), along with Theorem 7.2. 
Turning to correctness, we condition on both subroutines meeting their guarantee (i.e., \\D — D \\, < 
c-OPT+e and \\D — Z)^ € [A— e, A+e]), which happens with probability at least 8/10—1/poly(n) > 
3/4 by a union bound. 

• Soundness: If l\(D,C) < £\, then D is ei-close to some P € C, for which there exists 
an interval J C [n] of size at most M(n,e 1 ) such that P(J) > 1 — £ 1 . It follows that 
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D{J) > 1 — |ei (since \D(J) — P(J)\ < and _D(J) > 1 — §£i — 2-|e; establishing existence 
of a good interval I to be found (and Step (1) does not end with REJECT). Additionally, 
|| D — D\\ 1 < c - e i + e and by the triangle inequality this implies I\(D,C) < (1 + c)ei + e. 

Moreover, as D(I) > D(I) — 2 • | > 1 — |ei — 2e and D(/) — D(I) 
have 

D(/)>l-^!-2 £ -^ 


< ill D — D || 1; we do 


— 2 


-= 1 — r 


and the algorithm does not reject in Step (4). To conclude, one has by Fact 7.3 that 


II Di - ThHi < ^ \ } 1°2 g ^ 3 ( C£ 1 + e ) ( for < V 4 ’ as e < V 17 ) 

Therefore, A + A < li{D,C) + e + \\Di — Di\\ l +e < (4c+ l)ei +6e < £2 — ((6 + c)ei + ll£) = 9 
(the last inequality by the assumption on £ 2 ,£i), and the tester accepts. 

• Completeness: If i\ (D,C) > £ 2 , then we must have || D — l)|| 1 +^i(Z),C) > £ 2 . If the algorithm 
does not already reject in Step (4), then D(I ) > 1 — r. But, by Fact 7.3, 

o 

|| D: -Dt II, > II D -D || x - 2(D(I C ) + D{I C )) > ||D/ - £>/||! - 2(- £l + 2£ + r) 

= || Z? — — ((6 + c)ei + 9e) 

we then have ||Dj — Di\\ 1 +£\(D,C) > £2 — ((6 + c)ei + 9£). This implies A +A > £2 — ((6 + 
c)£i + 9 e) — 2e = £ 2 — ((6 + c)£i + He) = 9 , and the tester rejects. 

Finally, the testing algorithm defined above is computationally efficient as long as both the learning 
algorithm (Step (2)) and the estimation procedure (Step (5)) are. 

□ 
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A Proof of Lemma 2.9 


We now give the proof of Lemma 2.9, restated below: 

Lemma 2.9 (Adapted from [DKN15b, Theorem 11]). There exists an algorithm Check-Small -^2 
which, given parameters e,5 £ (0,1) and c-\f\I\/e 2 log ( 1/5) independent samples from a distribution 
D over I (for some absolute constant c > 0), outputs either yes or no, and satisfies the following. 

• If ||D — U-i || 2 > £/\/RT> then the algorithm outputs no with probability at least 1 — 5; 

• If ||D —1Ai || 2 < e/2yj\l\, then the algorithm outputs yes with probability at least 1 — 5. 

Proof. To do so, we first describe an algorithm that distinguishes between \\D — IA\\^ > e 2 /n and 
\\D — U\\\ < e 2 /(2n ) with probability at least 2/3, using C ■ samples. Boosting the success 
probability to 1 — 5 at the price of a multiplicative log | factor can then be achieved by standard 
techniques. 

def 

Similarly as in the proof of Theorem 11 (whose algorithm we use, but with a threshold r = 
|instead of ^=), define the quantities 

ry def ( ,. 7Tl\ r . 

Zk = \X k - — J - x k , k e [n] 

and Z c = Y2=\ z k, where the X k s (and thus the Z k s) are independent by Poissonization, and 
Xk ~ Poisson (mD (fc)). It is not hard to see that E Z k = A|, where A k = f (^ — D{k )), so that 

9 11 11 2 

E Z = m \\D — U H 2 - Furthermore, we also get 

Var Z k = 2m 2 ^ - A k ^j + 4m 3 ^ - A k ^j A k 

so that 

( n 1 n \ 

Var Z = 2m 2 £ A 2 + - - 2m ^ A 3 (2) 

\k =1 n k =1 / 

(after expanding and since Y(k=\ = 0)- 
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Soundness. Almost straight from [DKN15b], but the threshold has changed. Assume A 2 = f 
|D — U ||2 A £ 2 /n; we will show that Pr [Z < r] < 1/3. By Chebyshev’s inequality, it is sufficient 


to show that r < EZ — \/3\/VarZ, 


as 


Pr 


EZ - Z > V3\/Var Z 


< 1/3 . 


As r < §EZ, arguing that \/3\/Var Z < |EZ is enough, i.e. that 48Var Z < (EZ) 2 . From (2), this 
is equivalent to showing 

a 2 1 sr' a s m 2 A 4 

A + - - 2m E A * ^ 

n fc=i 

We bound the LHS term by term. 


96 


As A 2 > Sl, we get m 2 A 2 > and thus m 2 g^ 4 > 2 ^“ ^ A 2 > A 2 (as C > 17 and e < 1). 


Similarly, ^ • £ > ± 


288 — 288e 2 n — n' 

Finally, recalling that 9 


Eia,i 3 < E i A * 


3/2 


= A 3 


we get that 2 mJ2k=i | A fc| 3 
tA— > 1 (by choice of C > 576). 


< 2mA 3 = m2 ' 288 ^ ™ 2 A 4 


2 A 4 2-288 < . us j n g £} le f ac £ £} la £ > 


288 mA — 288 


Overall, the LHS is at most 3 • m 9g ^ = , as claimed. 


96 

|2 


Completeness. Assume A 2 = \\D —U || 2 < e 2 /(4n). We need to show that Pr[Z > r] < 1/3. 
Chebyshev’s inequality implies 

Pr Z — EZ > \/3VVar Z <1/3 
and therefore it is sufficient to show that 

r > EZ + v / 3v / Var Z 

Recalling the expressions of EZ and Var Z from (2), this is tantamount to showing 

3 m 2 e 2 


4 n 


> m 2 A 2 + \/6; 


m 


\ 


1 

A 2 H- 2m ^2 A 3 

n fe=i 


or equivalently 


\^=e 2 > m^/nA 2 + \/6 

4 -v/n “ 


N 


1 + nA 2 — 2n?n E4- 


fc=l 


For any sequence x = (an,..., a;„) £ R n , p > 0 H > ||a:|| is non-increasing. In particular, for 0 < p < q < oo, 


£ 


1/9 


1/p 


Xi\ 


= 11*11, < IMIp = E i 


To see why, one can easily prove that if ||a;|| p = L then ||a;||® < 1 (bounding each term \xi\ q < |an| p ), and therefore 
||x|| < 1 = ||a;||p. Next, for the general case, apply this to y = ®/||a;|| p , which has unit i v norm, and conclude by 
homogeneity of the norm. 
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Since yj 1 + nA 2 — 2 nm J2k=i ^k — Vl + nA 2 < \Jl + e 2 /4 < >/5/4, we get that the second term 
is at most ^/30/4 < 3. All that remains is to show that my/nA 2 > 3m-^=—3. But as A 2 < e 2 /(4n), 

mHnA 2 < 7n-r7=\ and our choice of rri > C ■ ^5 for some absolute constant C > 6 ensures this 
holds. □ 

B Proof of Theorem 4.5 

In this section, we prove our structural result for MHR distributions, Theorem 4.5: 

Theorem 4.5 (Monotone Hazard Rate). For all 7 > 0, the class AiHTZ of MHR distributions on 
[n] is (7, L)-decomposable for L 

Proof. We reproduce and adapt the argument of [CDSS13, Section 5.1] to meet our definition of 
decomposability (which, albeit related, is incomparable to theirs). First, we modify the algorithm 
at the core of their constructive proof, in Algorithm 3: note that the only two changes are in 
Steps 2 and 3, where we use parameters respectively ^ and Following the structure of their 

Algorithm 3 Decompose-MHR'(R),7) 

Require: explicit description of MHR distribution D over [n]; accuracy parameter 7 > 0 
1 : Set J 4 — [n] and Q 4 — 0. 

2 : Let I 4 - Right-Interval(R), J, 4) and I' 4 - Right-Interval(Z), J\J, 2). Set J 4 - J\(IU/'). 

3: Set * E J to be the smallest integer such that D(i) > A?. If no such i exists, let I" 4 — J and go 
to Step 9. Otherwise, let I" 4 — {1,..., i — 1} and J 4 — J \ I". 

4: while J 7 ^ 0 do 

5: Let j € J bet the smallest integer such that D(j) ^ [ 377 ;, 1 +7 ]D(i). If no such j exists, let 

I”' J; otherwise let I'" 4— {i,... ,j — 1}. 

6 : Add I'" to Q and set J 4 - J \ I"'. 

7: Let i 4— j. 

8 : end while 

9: Return Q U {/, I', I"} 


proof, we write Q = {I±,... , L|q|} with R = [a*, &,;], and define Q' = { R € Q : D(a ? ;) > D(ai + \) }, 

Q" = { R € Q : ^(Oi) < L>(a m ) }. 

We immediately obtain the analogues of their Lemmas 5.2 and 5.3: 

Lemma B.l. We have Ui ie Q '- T 

Lemma B.2. S'tep 4 of Algorithm 3 adds at most O^Mog^j intervals to Q. 

Sketch. This derives from observing that now D(I U P) > 7 /n, which as in [CDSS13, Lemma 5.3] 
in turn implies 

l > 2(i + 7 )is'l-i 

n 

so that | Q!\ = O log . 
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Again following their argument, we also get 


^( a |g|+i) 
D(a i) 


n 

Leg" 


D{aj+ 1 ) 
D{ai) 


n 

Leg' 


D ( a i + 1) 
D ( cii ) 


by combining Lemma B.l with the fact that D(a|g | +1 < 1 and that by construction D{di) > 7 /n 2 , 
we get 

t r D(ai+ 1 ) < n rr_ _ 
i}} Q „ D{ ai ) - 7 7 ~~ 7 ' 

But since each term in the product is at least (1 + 7 ) (by construction of Q and the definition of 
Q"), this leads to 

, n 3 

(1 + 7) |S "'<- 
7 

and thus \Q!'\ = log j) as well. □ 

It remains to show that QU{I , I', I"} is indeed a good decomposition of [n] for D, as per Definition 3.1. 
Since by construction every interval in Q satisfies item (ii), we only are left with the case of I, I' and 
I " . For the first two, as they were returned by Right-Interval either (a) they are singletons, in 
which case item (ii) trivially holds; or (b) they have at least two elements, in which case they have 
probability mass at most 7 (by the choice of parameters for Right-Interval) and thus item (i) is 
satisfied. Finally, it is immediate to see that by construction D(I") < n • 7 /n 2 = 7 /n, and item (i) 
holds in this case as well. □ 


C Proofs from Section 4 

This section contains the proofs omitted from Section 4, namely the distance estimation procedures 
for t-piecewise degree-d (Theorem 4.13), monotone hazard rate (Lemma 4.14), and log-concave 
distributions (Lemma 4.15). 

C.l Proof of Theorem 4.13 

In this section, we prove the following: 

Theorem C.l. Letp be a l-histogram over [—1,1). There is an algorithm PROJECTSiNGLEPoLY(d, s) 
which runs in time poly^, d+1, 1 /e), and outputs a degree-d polynomial q which defines a pdf over 
[- 1 , 1 ) such that \\p - qlh < 3£i(p,Vd) + O(e). 

As mentioned in Section 4, the proof of this statement is a rather straightforward adaption of 
the proof of [CDSS14a, Theorem 9], with two differences: first, in our setting there is no uncertainty 
nor probabilistic argument due to sampling, as we are provided with an explicit description of the 
histogram p. Second, Chan et al. require some “well-behavedness” assumption on the distribution 
p (for technical reasons essentially due to the sampling access), that we remove here. Besides these 
two points, the proof is almost identical to theirs, and we only reproduce (our modification of) 
it here for the sake of completeness. (Any error introduced in the process, however, is solely our 
responsibility.) 
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Proof. Some preliminary definitions will be helpful: 

Definition C.2 (Uniform partition). Let p be a subdistribution on an interval I C [—1,1). A 
partition Z = {I \,..., F} of / is (p, rj)-uniform if p(Ij) < p for all 1 < j < t. 

We will also use the following notation: For this subsection, let I = [—1,1) (/ will denote a 
subinterval of [—1,1) when the results are applied in the next subsection). We write ||/||^ to 
denote f f \ f(x)\dx , and we write dxv^^p, q) to denote \\p — q\\^ /2. We write OPT, ] to denote the 

infimum of the distance \\p — g||^ between p and any degree-d subdistribution g on I that satisfies 
g{I) =p(I). 

The key step of ProjectSinglePoly is Step 2 where it calls the FindSinglePoly procedure. 
In this procedure Ti(x ) denotes the degree-i Chebychev polynomial of the first kind. The function 
FindSinglePoly should be thought of as the CDF of a “quasi-distribution” /; we say that / = F' 
is a “quasi-distribution” and not a bona fide probability distribution because it is not guaranteed to 
be non-negative everywhere on [—1,1). Step 2 of FindSinglePoly processes / slightly to obtain 
a polynomial q which is an actual distribution over [—1,1). 

Algorithm 4 ProjectSinglePoly 

Require: parameters d, e; and the full description of a Uhistogram p over [—1,1). 

Ensure: a degree-d distribution q such that dxv(P) ?) < 3 • OPT^d + 0(e) 

1 : Partition [—1,1) into z = @((d+l)/e) intervals do = [io, U), • • •, I z -1 = [* z -i, i z ), where *o = — 1 
and i z = 1, such that for each j € {1 ,z} we have p(Ij) = 0(e/(d + 1)) or (\ Ij \ = 1 and 

p(I j ) = n(e/(d + 1))). 

2: Call FlNDSlNGLEPOLY(d, £, r) := 0(e/(d + 1)), {Iq, ..., I z ~ i}, p and output the hypothesis q 
that it returns. 


The rest of this subsection gives the proof of Theorem C.l. The claimed running time bound 
is obvious (the computation is dominated by solving the poly(d, l/<s)-size LP in ProjectSingle¬ 
Poly, with an additional term linear in l when partitioning [—1,1) in the initial first step), so it 
suffices to prove correctness. 

Before launching into the proof we give some intuition for the linear program. Intuitively F(x) 
represents the cdf of a degree-d polynomial distribution / where / = F'. Constraint (a) captures 
the endpoint constraints that any cdf must obey if it has the same total weight as p. Intuitively, 
constraint (b) ensures that for each interval [ij,ik), the value F(ik)—F(ij) (which we may alternately 
write as /( [ij > '4))) is close to the weight p([ij,ik)) that the distribution puts on the interval. Recall 
that by assumption p is OPTi^-close to some degree-d polynomial r. Intuitively the variable W£ 
represents (note that these values sum to zero by constraint (c)(4), and yi represents 

the absolute value of w£ (see constraint (c)(5)). The value r, which by constraint (c)( 6 ) is at least 
the sum of the yf s, represents a lower bound on OPTi^. The constraints in (d) and (e) reflect 
the fact that as a cdf, F should be bounded between 0 and 1 (more on this below), and the (f) 
constraints reflect the fact that the pdf / = F' should be everywhere nonnegative (again more on 
this below). 

We begin by observing that ProjectSinglePoly calls FindSinglePoly with input param¬ 
eters that satisfy FindSinglePoly’s input requirements: 

(I) the non-singleton intervals Iq, ... ,I Z -1 are (p, ^-uniform; and 
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Algorithm 5 FindSinglePoly 


Require: degree parameter d: error parameter e; parameter q: (p, r/j-uniform partition X/ = 
{/i,,.., I z } of interval I into z intervals such that y/ez ■ rj < e/2; a subdistribution p on 
I 

Ensure: a number r and a degree-d sub distribution q on I such that q(I) = p(I), 

OPT^] < \\p — g||^ < 3 opt^] + ez(d + 1 ) • r) + error, 

0 < r < OPT^ and error = 0 ((d + 1 )?/). 

1 : Let r be the solution to the following LP: 

minimize r subject to the following constraints: 

(Below F{x) = J2i=o c iTi{x) where Xj(x') is the degree-i Chebychev polynomial of the first kind, 
and f(x) = F'(x) = E^o^'( x )-) 

(a) F(— 1) = 0 and F(l) = p(I); 

(b) For each 0 < j < k < z, 


p([ij,i k ))+ w n ~ ( F( dk) ~ F {ij)) < \J £ -( k ~j) ■ m 

j<i<k J 


w e = °’ 

o <e<z 

—yt < wi < Vi for all 0 < i < z, 

Y, ye< T i 

0 <i<z 

(d) The constraints |c*| < \pi for * = 0,. .., d + 1; 

(e) The constraints 

0 < F(z) < 1 for all z € J, 

where J is a set of 0((d + l) 6 ) equally spaced points across [—1,1); 

(f) The constraints 

d 

y/ c,T’ (x) > 0 for all x € K , 

i =o 

where K is a set of 0((d + l) 2 /e) equally spaced points across [—1,1). 
2 : Define q{x) = e/(/)/ |/| + (1 — e)f(x). Output q as the hypothesis pdf. 





(II) the singleton intervals each have weight at least jjj. 

We then proceed to show that, from there, FindSinglePoly’s LP is feasible and has a high- 
quality optimal solution. 

Lemma C.3. Suppose p is an I-histogram over [—1,1), so that conditions (I) and (II) above hold; 
then the LP defined in Step 1 o/FindSinglePoly is feasible; and the optimal solution r is at most 

OPT M . 

Proof. As above, let r be a degree-d polynomial pdf such that OPTi^ = \\p — and r(J) = p(I). We 
exhibit a feasible solution as follows: take F to be the cdf of r (a degree d polynomial). Take W£ 
to be f[i e! i e+1 )( r ~ V ), and take ye to be \we\. Finally, take r to be Eo <£<z Vi- 

We first argue feasibility of the above solution. We first take care of the easy constraints: since 
F is the cdf of a subdistribution over I it is clear that constraints (a) and (e) are satisfied, and 
since both r and p are pdfs with the same total weight it is clear that constraints (c)(4) and (f) 
are both satisfied. Constraints (c)(5) and (c)( 6 ) also hold. So it remains to argue constraints (b) 
and (d). 

Note that constraint (b) is equivalent to p + (r — p) = r and r satisfying (X,e/(d + l),e)- 
inequalities, therefore this constraint is satisfied. 

To see that constraint (d) is satisfied we recall some of the analysis of Arora and Khot [AK03, 
Section 3]. This analysis shows that since F is a cumulative distribution function (and in particular 
a function bounded between 0 and 1 on I) each of its Chebychev coefficients is at most y/2 in 
magnitude. 

To conclude the proof of the lemma we need to argue that r < OPTi^. Since we = lf ^ (r — p) 
it is easy to see that t = Eo <i<zDz = Eo<^<z \ w t\ — \\v ~ r lli) and hence indeed r < OPTi jd as 
required. □ 

Having established that with high probability the LP is indeed feasible, henceforth we let r 
denote the optimal solution to the LP and F, /, we, Ci, ye denote the values in the optimal solution. 
A simple argument (see e.g. the proof of [AK03, Theorem 8 ]) gives that H-FH^ < 2. Given this 
bound on ||.F|| 00 , the Bernstein-Markov inequality implies that ||/|| 00 = ||F 1, || 00 < 0((d + l) 2 ). 
Together with (f) this implies that f(z) > —e/2 for all z € [—1,1). Consequently q(z) > 0 for all 
z € [— 1 , 1 ), and 

J q{x)dx = e + (1 — e) J f(x)dx = e + (1 — s)(F( 1 ) — F(— 1 )) = 1 . 

So q(x ) is indeed a degree-d pdf. To prove Theorem C.l it remains to show that \\p — < 

3opt m + O(e). 

We sketch the argument that we shall use to bound ||p — A key step in achieving this 
bound is to bound the ||-||^ distance between / and p m + w where A = Ad+i is the class of all 
unions of d + 1 intervals and w is a function based on the we values (see (9) below). If we can 
bound || (p + w) — f\\ A < 0(e) then it will not be difficult to show that ||r — f\\ A < OPTi^ + 0(e).. 
Since r and / are both degree-d polynomials we have ||r — /(^ = 2||r — f\\^ < 2 opti^ + O(e), so 
the triangle inequality (recalling that \\p — = OPTi^) gives \\p — /(^ < 30PT 1( i + 0(e). From 

this point a simple argument (Proposition C.5) gives that ||p — q\h < ||p — f\\i + O(e), which gives 
the theorem. 

We will use the following lemma that translates (X, 77 , e)-inequalities into a bound on Ad+i 
distance. 
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Lemma C.4. Let Z = {Iq = [*o, fi), ■ ■ ■, I z -i = [i z -iAz)} 6 e a (p,rj)-uniform partition of I, 
possibly augmented with singleton intervals. If h: I —>• M and p satisfy the (Z, 77 , e) -inequalities, 
then 

\\p ~ - \f^z(d+V) ■ rj + error, 

where error = 0((d + 1 ) 77 ). 

Proof. To analyze \\p — h ||_ 4 d+1 , consider any union of d + 1 disjoint non-overlapping intervals S = 
Ji U • • • U Jd+ 1 - We will bound \\p — h\\ Ad+i by bounding k(S’) — h(S) |. 

We lengthen intervals in S slightly to obtain T = «/{ U - • -UJ' d+1 so that each Jj is a union of inter¬ 
vals of the form [ip, it+i). Formally, if Jj = [a, b), then Jj = [a', b'), where a' = max; { ie ■ ii < a } 
and b' = min; { i; : i; > b }. We claim that 

\p(S) ~ h(S )I < 0((d + 1 ) 77 ) + |p(T) - /(Z)| . (7) 

Indeed, consider any interval of the form J = |7;, iz+i) such that J fl S / J PI T (in particular, such 
an interval cannot be one of the singletons). We have 

\p(J n S) - p{J HT)\ < p(J) < 0 ( 77 ), ( 8 ) 

where the first inequality uses non-negativity of p and the second inequality follows from the bound 
p([ig,it + 1 )) < 77 . The (Z, 77 , e)-inequalities (between h and p) implies that the inequalities in ( 8 ) 
also hold with h in place of p. Now (7) follows by adding ( 8 ) across all J = [iz,ie+ 1 ) such that 
J n S 7 ^ J n T (there are at most 2 [d + 1) such intervals J), since each interval Jj in S can change 
at most two such J’s when lengthened. 

Now rewrite T as a disjoint union of s < d + 1 intervals [il-lPr-i) U • • • U [ iL s ,iR s )• We have 

S 

\p(T) - h(T)\ < \[Rr~Lj • V^T7 
3 = 1 

by (Z, 77 , e)-inequalities between p and h. Now observing that that 0 < L\ < R\ ■ ■ ■ < L s < R s < 
t = 0((d + l)/e), we get that the largest possible value of J2j =1 yfRj ~ Lj is yfsz < \J{d+ 1 )z, so 
the RHS of (7) is at most 0((d + 1 ) 77 ) + y/(d+ 1)z£rj, as desired. □ 

Recall from above that F, /, wz, Ci, yz, r denote the values in the optimal solution. We claim 
that 

\\{p + w) - f\\ A = 0(e), (9) 

where w is the subdistribution which is constant on each [iz,iz+ 1 ) and has weight wz there, so 
in particular ||i 6 ?|| x < r < OPTi^. Indeed, this equality follows by applying Lemma C.4 with 
h = f — w. The lemma requires h and p to satisfy (Z, 77 , e)-inequalities, which follows from con¬ 
straint (b) ((Z, 77 , ej-inequalities between p + w and /) and observing that (p + w) — / = p— (/ — w). 
We have also used 77 = @(e/(d + 1)) to bound the error term of the lemma by 0(e). 

Next, by the triangle inequality we have (writing A for A,i+i ) 

Ik - / IU ^ Ik - (p + ™)IU + II(p + w ) - f\ U- 

The last term on the RHS has just been shown to be 0(e). The first term is bounded by 

Ik — (p + «OIU ^ \\\ r - (p + w )\\i < ^(Ik-plli + ll^lli) < oPTi >d . 
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Altogether, we get that ||r — f\\^ < OPTi^ + 0 ( e ). 

Since r and / are degree d polynomials, ||r — f\h = 2||r — f\\^ < + O(e). This implies 

\\p — f Hi < || p — + ||r — flh < 3opti ^ + O(e). Finally, we turn our quasidistribution / which 

has value > —e/2 everywhere into a distribution q (which is nonnegative), by redistributing the 
weight. The following simple proposition bounds the error incurred. 

Proposition C.5. Let f and p be any sub-quasidistribution on I. If q = e/(/)/ |/| + (1 — e)/, 
then Hg-pllj < ||/- p\\ x + e(/(/) +p(/)). 

Proof. We have 

q-p = £(/(/)/ \I\ - p) + (l - e)(f - p). 

Therefore 


Ik-Pllr < ^ll/CO/l-H -Plli + (1-£)||/-Plli < e(/C0 +p( 1 )) + ll/-p|li- □ 

We now have ||p — < ||p — /|| 1 +0(e) by Proposition C.5, concluding the proof of Theorem C.l 

□ 


C.2 Proof of Lemma 4.14 

Lemma 4.14 (Monotone Hazard Rate). There exists a procedure ProjectionDist^^^ that, on 
input n as well as the full specification of a k-histogram distribution D on [n] and of a (.-histogram 
distribution D' on [n], runs in time poly(n, 1/e), and satisfies the following. 

• If there is P € A iTLIZ such that \\D — P|| x < e and \\D' — P|| Kol < e 3 , then the procedure 
returns yes; 

• If (\(D , MTLIZ) > lOOe, then the procedure returns no. 

Proof. For convenience, let a == e 3 ; we also write [i, j] instead of {i,... , j}. 

First, we note that it is easy to reduce our problem to the case where, in the completeness case, 
we have P € A4LLTZ such that ||D — < 2e and \\D — P|| Ko i < 2a; while in the soundness case 

(±(D, AiUTZ) > 99e. Indeed, this can be done with a linear program on poly(/c, £) variables, asking 
to find a {k + ^)-histogram D" on a refinement of D and D' minimizing the l\ distance to D, under 
the constraint that the Kolmogorov distance to D' be bounded by e. (In the completeness case, 
clearly a feasible solution exists, as P is one.) We therefore follow with this new formulation: either 

(a) D is e-close to a monotone hazard rate distribution P (in i\ distance) and D is a-close to P 
(in Kolmogorov distance); and 

(b) D is 32e-far from monotone hazard rate 
where D is a (k + ^)-histogram. 

We then proceed by observing the following easy fact: suppose P is a MHR distribution on [n], 
i.e. such that the quantity hi = , i € [n] is non-increasing. Then, we have 


i—1 

P(i) = hi ni 1 - hj), ie[n\. (10) 

3 = 1 

and there is a bijective correspondence between P and (hi)ie[n]■ 
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We will write a linear program with variables y\,... ,y n , with the correspondence y, *= In (l—h t ). 
Note that with this parameterization, we get that if the (yi)i^\ n 1 correspond to a MHR distribution 
P, then for i £ [n] 

P([i,n]) = ]^[ e yj = e^-o’= 1% 

j=i 

and asking that ln(l — e:) < Uj ~ In D([i, n\) < ln(l + e) amounts to requiring 

P([i,n ]) G [l±e]£)([z,fi]). 


We focus first on the completeness case, to provide intuition for the linear program. Suppose 
there exists P £ M.WZ such P € A47-L1Z such that || D — P|| x < e and \\D' — P|| Kol < a. This 
implies that for all i € [n], \P([i,n}) — D([i,n])\ < 2a. Define / = {b + 1,..., n} to be the longest 
interval such that D{{b + 1,... ,n}) < It follows that for every i £ [n] \ /, 


P([i,n]) < D{[i,n]) + 2a < 2 a_ 

D{[i,n ]) _ D([i,n]) ~ e/2 


1 + 4e 2 < 1 + e 


( 11 ) 


and similarly > D ^/j^ n ^ a > 1 — e. This means that for the points i in [n] \ I, we can write 

— 1 

constraints asking for multiplicative closeness (within lie) between e^= 1% and D([i,n]), which 
is very easy to write down as linear constraints on the y/s. 


The linear program. Let T and S be respectively the sets of “light” and “heavy” points, defined 
as T = { i £ {1 5 ? . , 6} : D(i) < e 2 } and S = { i € {1,..., b} : D(i) > e 2 }, where b is as above. 
(In particular, |5| < 1/e 2 .) 

Algorithm 6 Linear Program 


Find 

s.t. 


2 / 1 , - - -, 2/6 

Vi < 0 
Vi+i < Vi 


i— 1 


ln(l — e) < ^2 yj — In D([i, n]) < ln(l + e) 
3 = 1 

D(i) — Ei , d . D(i) + Si 

W <- 2 /i<( l + 4e)- w 


(1 + e)D[i,n] 

iGT 

0<Si<2a 


(1 — e)D[i, n\ 


In I 1 — . P(,:) L 2a . !<«,.< In (l- D(i)_2 “ 


(1 — e)D[i, n] 


(1 + e)D[i,n] 


Vi £ {1,..., 6 — 1} 

( 12 ) 

(13) 

Vie {1, ...,&} 

(14) 

Vi e T 

(15) 


(16) 

Vi e T 

(17) 

Vi e 5 

(18) 
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Given a solution to the linear program above, define P (a non-normalized probability distribu- 

E i — 1 ___ 

j =1 V3 for i € { 1 ,..., 6 }, and P{i) = 0 for *€/ = {& + 1 ,..., n}. 
A MHR distribution is then obtained by normalizing P. 


Completeness. Suppose P € A47-L1Z is as promised. In particular, by the Kolmogorov distance 
assumption we know that every i € T has P(i) < e 2 + 2a < 2e 2 . 

• For any z € T, we have that p\fy < (if £ ) e 5; 4e, and 

~ £i < < _ l n (i _ ) < (i | lc ) = (i +4£ )^)±3 < 1 + 4e + £ * 


(1 + e).D[z,n] P[i,n\ 


P[i, n] 


P[i , n] 




P[z,n] 1 — e D[i,n] 
(19) 


where we used Equation 11 for the two outer inequalities; and so (15), (16), and (17) would 

def 

follow from setting e* = \P(i) — D(i)\ (along with the guarantees on l\ and Kolmogorov 
distances between P and D). 

■ Fnr " C S Gnn-tnint il i-nl-n mrt n" C \ D ( l )~' 2a D(i)+2a~\ \ D(i)—2a D(i)+2a 

• lor ifci, Constraint (18J is also met, as [ p([i, n ]) > P([i,n\) \ ~ Lli+eMMT’ (i-e)£>(M: 

Soundness. Assume a feasible solution to the linear program is found. We argue that this implies 
D is 0(e)-close to some MHR distribution, namely to the distribution obtained by renormalizing 
P. 

In order to do so, we bound separately the t\ distance between D and P, from I, S , and T. 
First, J2iei D(i) — P{i) = D{i) < | by construction. For z € T, we have < s, and 


P(z) = (1 - e Vi) )e^i= lVj € [1 ± e] (1 - e yi )D([i, n]). 


Now, 


1 - (1 - e ) D(yl \ r £i > > e Vi > e _{ 1 + 4 e) u-SSn > 1 - (1 + 4e)- D ^ + £i 


so that 


which implies 


(1 + e)D[i 1 n\ 

(1 - £) - C < m < (1 + 45)Kth(D(i) + e, 

(1 - 10 e){D(i) - Ei) < P(z) < (1 + 10e)(.D(z) + e*) 


(1 — e)D[i, n] 


so that X)ieT P(*) — P(i) < 10 e X)ieT P(*) + (1 + lOe) SieT £ i — 10 £ + (1 + 10 e)e < 20 e where the 
last inequality follows from Constraint (16). 

To analyze the contribution from S , we observe that Constraint (18) implies that, for any z € S, 


D{i) — 2 a 


< 


P(i) 


< 


D(i) + 2a 


(1 + e)D{[i,n]) P([i,n]) (1 - e)D([i,n]) 

which combined with Constraint (14) guarantees 

D(i) — 2 a P(i) D(i ) + 2 a 


< 


< 


(1 +efP{[iM) ~ P([i,n ]) - (1-£) 2 P(M) 
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which in turn implies that P(i) — D(i ) < 3eP(i) + 2a. Recalling that |Sj < and a = £ 3 , 
this yields Jfies D(i) — P(i) < 3+ 2e < 3e(l + e) + 2e < 8e. Summing up, we 
get Ya =i ^(*) — -P(*) < 30e which finally implies by the triangle inequality that the I\ distance 
between D and the normalized version of P (a valid MHR distribution) is at most 32e. 


Running time. The running time is immediate, from executing the two linear programs on 
poly(n, 1 /e) variables and constraints. □ 

C.3 Proof of Lemma 4.15 

Lemma 4.15 (Log-concavity). There exists a procedure ProjectionDist^ that, on input n as 
well as the full specifications of a k-histogram distribution D on [n] and a I-histogram distribution 
D' on [n], runs in time poly(n, k, £, 1/e), and satisfies the following. 

2 

• If there is P € C such that \\D — < e and \\D' — -P|| Kol < , gl f (1 ^ . then the procedure 

returns yes; 

• If > lOOe, then the procedure returns no. 

Proof. We set a d = log / ( 2 1/e) , P = f log fi/ e) , and 7 = f (so that a < < 7 < e), 

Given the explicit description of a distribution D on [n], which a £;-histogram over a partition 
X = (Ji,..., Jfc) of [n] with k = poly (log n, 1/e) and the explicit description of a distribution D' on 
[n], one must efficiently distinguish between: 

(a) D is e-close to a log-concave P (in i\ distance) and D' is a-close to P (in Kolmogorov 
distance); and 

(b) D is lOOe-far from log-concave. 

If we are willing to pay an extra factor of O(n), we can assume without loss of generality that we 
know the mode of the closest log-concave distribution (which is implicitly assumed in the following: 
the final algorithm will simply try all possible modes). 

Outline. First, we argue that we can simplify to the case where D is unimodal. Then, reduce 
to the case where where D and D' are only one distribution, satisfying both requirements from 
the completeness case. Both can be done efficiently (Section C.3.1), and make the rest much 
easier. Then, perform some ad hoc partitioning of [n], using our knowledge of D, into 0(l/e 2 ) 
pieces such that each piece is either a “heavy” singleton, or an interval / with weight very close 
(multiplicatively) to D(I) under the target log-concave distribution, if it exists (Section C.3. 2). This 
in particular simplifies the type of log-concave distribution we are looking for: it is sufficient to look 
for distributions putting that very specific weight on each piece, up to a (1 + o(l)) factor. Then, in 
Section C.3.3, we write and solve a linear program to try and find such a “simplified” log-concave 
distribution, and reject if no feasible solution exists. 

Note that the first two sections allow us to argue that instead of additive (in I\) closeness, 
we can enforce constraints on multiplicative (within a (1 + e) factor) closeness between D and the 
target log-concave distribution. This is what enables a linear program with variables being the 
logarithm of the probabilities, which plays very nicely with the log-concavity constraints. 

We will require the following result of Chan, Diakonikolas, Servedio, and Sun: 


44 











Theorem C.6 ([CDSS13, Lemma 4.1]). Let D be a distribution over [n], log-concave and non¬ 
decreasing over {1,...,6} C [n|. Let a < b such that a = D({1,... ,a — 1}) > 0, and write 
T = D({a ,... ,b}). Then ^ <1 + 1- 

C.3.1 Step 1 

Reducing to D unimodal. Using a linear program, find a closest unimodal distribution D to 
D (also a ^-histogram on X) under the constraint that \\D — 11 Kol < a: this can be done in time 

poly(fc). If ||D — > e, output REJECT. 

• If D is e-close to a log-concave distribution P as above, then it is in particular e-close to 
unimodal and we do not reject. Moreover, by the triangle inequality \\D — P|| x < 2e and 
\\D ~ X|| Kol < 2a. 

• If D is lOOe-far from log-concave and we do not reject, then t\(D,C) > 99e. 

Reducing to D = D'. First, we note that it is easy to reduce our problem to the case where, in 
the completeness case, we have P £ C such that \\D — P^ < 4e and ||P — P11 Ko i < 4a; while in 
the soundness case l\ {D,C) > 97e. Indeed, this can be done with a linear program on poly (k, £) 
variables and constraints, asking to find a (k + £)-histogram D" on a refinement of D and D' 
minimizing the i\ distance to D, under the constraint that the Kolmogorov distance to D' be 
bounded by 2a. (In the completeness case, clearly a feasible solution exists, as (the flattening on 
this (k + 7)-interval partition) of P is one.) We therefore follow with this new formulation: either 

(a) D is 4e-close to a log-concave P (in i\ distance) and D is 4a-close to P (in Kolmogorov 
distance); and 

(b) D is 97e-far from log-concave; 
where D is a (k + £)-histogram. 

This way, we have reduced the problem to a slightly more convenient one, that of Section C.3.2. 

Reducing to knowing the support [o, b\. The next step is to compute a good approximation 
of the support of any target log-concave distribution. This is easily obtained in time 0{k) as the 
interval {a, • • • ,b} such that 

• D({ 1,..., a — 1}) < a but D({ 1,..., a}) > a; and 

• D({b + 1,..., }n) < a but D({b ,..., n}) > a. 

Any log-concave distribution that is a-close to D must include {a, • • • , 6} in its support, since 
otherwise the t\ distance between D and P is already greater than a. Conversely, if P is a log- 
concave distribution a-close to D, it is easy to see that the distribution obtained by setting P to 
be zero outside {a, ■ ■ ■ ,b} and renormalizing the result is still log-concave, and 0(a)-close to D. 

C.3.2 Step 2 

Given the explicit description of a unimodal distribution D on [n], which a ^-histogram over a 
partition X = (Ji,..., Iy.) of [n] with k = poly(logn, 1/e), one must efficiently distinguish between: 
(a) D is e-close to a log-concave P (in distance) and a-close to P (in Kolmogorov distance); 
and 


45 



(b) D is 24e-far from log-concave, 

assuming we know the mode of the closest log-concave distribution, which has support [n]. 

In this stage, we compute a partition J of [n] into 0( 1/e 2 ) intervals (here, we implicitly use 
the knowledge of the mode of the closest log-concave distribution, in order to apply Theorem C.6 
differently on two intervals of the support, corresponding to the non-decreasing and non-increasing 
parts of the target log-concave distribution). 

As D is unimodal, we can efficiently (0(log/c)) find the interval S of heavy points, that is 

S = f { x £ [n\ : D(x ) > /3 } . 

def 

Each point in S will form a singleton interval in our partition. Let T = [n] \ S be its complement 
(T is the union of at most two intervals T\,T 2 on which D is monotone, the head and tail of 
the distribution). For convenience, we focus on only one of these two intervals, without loss of 
generality the “head” T\ (on which D is non-decreasing). 

1. Greedily find J = {1, the smallest prefix of the distribution satisfying D(J) € 

["hi Tij] • 

2. Similarly, partition Ti \ J into intervals (with s = 0 ( 1 / 7 ) = 0( 1/e 2 )) such that 

^ < D(Ij ) < for all 1 < j < s — 1, and ^ < D{I ' S ) < 7 . This is possible as all points 
not in S have weight less than /3, and /3 -C 7 . 


Discussion: why doing this? 

distribution such that I ID — P 


1 < £ and 


D-P 

/(•’s, we obtain (using the fact that P(/') — D(I'-) 


We focus on the completeness case: let P £ C be 
Kol < a. Applying Theorem C .6 
< 2 a) that: 


a log-concave 
on J and the 


max x6/; . P(x) D(E)+2a 7 + 2a 

- l — <i_|- 111 -< 1 + — - 

min xeI > P(x) D{J) — 2a — 2a 


1 + £ + O 


log ( 1 /e) 


def 1 

= 1 T ft. 


Moreover, we also get that each resulting interval /) will satisfy 

D(Ij)(l - Kj) = D(I') -2a < P(/j) < D(Ij) + 2a = £>(/')( 1 + Kj ) 

with Kj d = = ©(l/log 2 (1/e)). 

Summing up, we have a partition of [n] into |5| + 2 = 0(l/e 2 ) intervals such that: 

• The (at most) two end intervals have D{ J) € [^j — /3, , and thus P( J) € — P — 2a, ^ + 2a] 

• the 0(l/e 2 ) singleton-intervals from S are points x with D(x) > j3, so that P(x) > /3—2a > 

• each other interval I = satisfies 

(1 - Kj)D(I) < P(I) < (1 + Kj)D(I) (20) 

with Kj = 0^1/log 2 (l/e)); and 


rnax^gj P(x) 
min xe j P(x) 


< 1 + K < 1 + ~£. 


( 21 ) 


We will use in the constraints of the linear program the fact that (1 + |e)(l + Kj) < 1 + 2e, and 

l ~ K 3 > 1 

l+|i - l+2e- 
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C.3.3 Step 3 

We start by computing the partition J = (J\,... ,Jg) as in Section C.3.2; with l = 0( 1/e 2 ); and 
write Jj = { a,j ,..., bj} for all j E [£}. We further denote by S and T the set of heavy and light 

points, following the notations from Section C.3.2; and let V = T\ U T 2 be the set obtained by 
removing the two “end intervals” (called J in the previous section) from T. 


Algorithm 7 Linear Program 


Xi , . . . , E \, . . . , ^151 



Xi < 0 


( 22 ) 

Xf Xi— i ^ Xi-\- 1 X{ 

Vi E [n\ 

(23) 

— ln(l + 2 e) < Xi — pj < ln(l + 2 e), 

Vj € T'.Vi E Jj 

(24) 

- 2 4 £l ‘- ,n m -Dii Y 

Vi£S 

(25) 

Y, £ ^ £ 


(26) 

i&S 



0 < £i < 2a 

Vi£S 

(27) 



(28) 


where fij == In for j E T'. 


Lemma C.7 (Soundness). If the linear program (Algorithm 7) has a feasible solution, thenl\[D,Cf) < 
0(e). 

Proof. A feasible solution to this linear program will define (setting pi = e Xi ) a sequence p = 
(pi ,... ,p n ) E (0, l] n such that 

• p takes values in ( 0 , 1 ] (from ( 22 )); 

• p is log-concave (from (23)); 

• p is “(1 + 0(e))-multiplicatively constant” on each interval Jj (from (24)); 

• p puts roughly the right amount of weight on each Jp. 

— weight (1 ± 0{e))D(J ) on every J from T (from (24)), so that the l\ distance between 
D and p coming from T' is at most 0(e); 

— it puts weight approximately D{ J) on every singleton J from S, i.e. such that D{ J) > /3. 
To see why, observe that each gj is in [0, 2 a] by constraints (27). In particular, this means 
that < 2 ^ <S 1 , and we have 

D(i) — 4 £i < D(i ) • e 4 °W <Pi = e Xi < D{i ) • e 2 < D{i) + 4e* 

and together with (26) this guarantees that the distance between D and p coming 
from S is at most e. 


47 








Note that the solution obtained this way may not sum to one - i.e., is not necessarily a probability 

distribution. However, it is easy to renormalize p to obtain a bona fide probability distribution P 

as follows: set P = ^ p ^ —-- for all i € S U T', and p{i) = 0 for i € T \ V. 
z^igsuT' pW 

Since by the above discussion we know that plS U T') is within 0(e) of D(S U T') (itself in 
[ 1 — 1 +^] by construction of T'), P is a log-concave distribution such that ||P — = 0(e). □ 

Lemma C.8 (Completeness). If There is P in C such that \\D — P^ < e and \\D — P|| Kol < a, 
then the linear program (Algorithm 1) has a feasible solution. 

Hpf 

Proof. Let P € C such that ||P — P^ < e and \\D — P|| Kol < a. Define Xi = lnP(i) for all i € [n]. 
Constraints (22) and (23) are immediately satisfied, since P is log-concave. By the discussion 
from Section C.3.2 (more specifically, Eq. (20) and (21)), constraint (24) holds as well. 

Letting e^ | P(i) — D(i)\ for i E S, we also immediately have (26) and (27) (since ||P — D\\ 1 < 
e and ||D — P|| Kol < a by assumption). Finally, to see why (25) is satisfied, we rewrite 


Xi — In D{i ) = In 


P[p 

D[i) 


In 


P{i)±£i 

D(i) 


ln(l ± 


£i 

D(i) 


) 


and use the fact that ln(l + x) < x and ln(l 

* < T « *)• 


x) > — 2x (the latter for x < along with 

□ 


C.3.4 Putting it all together: Proof of Lemma 4.10 

The algorithm is as follows (keeping the notations from Section C.3.1 to Section C.3.3): 

• Set a, (5, 7 as above. 

• Follow Section C.3.1 to reduce it to the case where D is unimodal and satisfies the conditions 
for Kolmogorov and I\ distance; and a good [a, b] approximation of the support is known 

• For each of the O(n) possible modes c £ [a, b ]: 

— Run the linear program Algorithm 7, return ACCEPT if a feasible solution is found 

• None of the linear programs was feasible: return REJECT. 

The correctness comes from Lemma C.7 and Lemma C .8 and the discussions in Section C.3.1 
to Section C.3.3; as for the claimed running time, it is immediate from the algorithm and the fact 
that the linear program executed each step has poly(n, 1/e) constraints and variables. 

□ 


D Proof of Theorem 6.3 

In this section, we establish our lower bound for tolerant testing of the Binomial distribution, 
restated below: 

Theorem 6.3. There exists an absolute constant Eq > 0 such that the following holds. Any al¬ 
gorithm which, given sampling access to an unknown distribution D on D and parameter e € 
(0,eo), distinguishes with probability at least 2/3 between (i) ||Zi> — Bin(ri, 1/2) || x < e and (ii) 
||D — Bin(n, 1/2) || x > lOOe must use samples. 
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The theorem will be a consequence of the (slightly) more general result below: 

Theorem D.l. There exist absolute constants £q > 0 and A > 0 such that the following holds. 
Any algorithm which, given SAMP access to an unknown distribution D on Cl and parameter 
£ € (0,£o)> distinguishes with probability at least 2/3 between (i) \\D — Bin(Vj, /) || < e and (ii) 

|| D — Bin^n, ^ || > As 1 / 3 — e must use ^( £ Iog( ^ 2 / 3 n ) ) samples. 

By choosing a suitable and working out the corresponding parameters, this for instance enables 
us to derive the following: 

Corollary D.2. There exists an absolute constant Eq € (0,1/1000) such that the following holds. 
Any algorithm which, given SAMP access to an unknown distribution D on Cl, distinguishes with 
probability at least 2/3 between (i) \\D — Bin^n, || < £o and (ii) \\D — Bin^n, || > 100£o must 

use 14 samples. 

By standard techniques, this will in turn imply Theorem 6.3. 

Proof of Theorem D.l. Hereafter, we write for convenience B n = f Bin^n, . To prove this lower 
bound, we will rely on the following: 

Theorem D.3 ([VV10, Theorem 1]). For any constant € (0,1/4), following holds. Any algorithm 
which, given SAMP access to an unknown distribution D on Cl, distinguishes with probability at least 
2/3 between (i) \\D — li n \\ l < 4> and (ii) \\D —U n \\ l >\ — (f, must have sample complexity at least 

4> n 
32 log n ' 

Without loss of generality, assume n is even (so that B n has only one mode located at ^). For 
c > 0, we write I nfi for the interval {§ — Cy/n ,..., ^ + Cy/n} and J„ jC = f 14 \ I n c . 

Fact D.4. For any c > 0, 

#n(f Acy/n) H n (§ - cyfn) ^ _ 2c a 

B n (n/2) ’ B n (n/ 2) »^oo 6 

and 

B n (In,c) G (1 ± 0(1)) • [e- 2c \ 1] • 2 C] f^ = 0(c). 

The reduction proceeds as follows: given sampling access to D on [n], we can simulate sampling 
access to a distribution D' on [N] (where N = 0(n 2 )) such that 

• if ||ZD — U n ||i < 4>, then \\D' — Hjvlli < £; 

• if ||ZD — ld n \\ l > | — 4>, then ||ZD 7 — ZD^v||i > e' — £ 

for £ = f 0 (</> 3 / 2 ) and £ ] = f 0(05) ; in a way that preserves the sample complexity. 

More precisely, define c = f yj2 In = 0(v^) (so that <j> = 1 — e~ 2c2 ) and N such that 

|Zjv,e| = n (that is, N = (n/(2c )) 2 = 0(n 2 /c/)). From now on, we can therefore identify [n] to In,c 
in the obvious way, and see a draw from D as an element in In.c- 

Let p c = Bn{In,c) = 0 (v^), and Bn, c , -B/v,c respectively denote the conditional distributions 
induced by Bn on In, c and Jn,c- Intuitively, we want D to be mapped to the conditional distribution 
of D' on In, a and the conditional distribution of D' on Jn, c to be exactly Bn, c ■ This is done as by 
defining D' by the process below: 
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• with probability p, we draw a sample from D (seen as an element of In,c] 

• with probability 1 — p, we draw a sample from Bn, c - 

Let Bn be defined as the distribution which exactly matches Bn on J niC , but is uniform on / niC : 

B N (= i€i "’ c 

i £ Jn,c 

From the above, we have that \\D' — F?/v||i = P ■ ||L) — lA n \ v Furthermore, by Fact D.4, Lemma 2.8 
and the definition of I n . c , we get that \\Bn — S/vlli = P ■ ||(-Bjv)/„ iC — ZY / nc || 1 < p ■ (j). Putting it all 
together, 

• If ||D — U. n \ x < (j), then by the triangle inequality HD' — .Bjvlli < p{4> + (j>) = 2 pf>; 

• If ||D — U n \\ 1 >\ — (j), then similarly \\D' — Bn || x > p{\ — <t> — f) = | — 2 pcj). 

Recalling that p = 0(y^) and setting e 2 pcj) concludes the reduction. From Theorem D.3, we 

conclude that _ 

0 n / \ VN \ 

32 log n y log (cf>N) J y log(e 2 / 3 A^) J 

samples are necessary. 

□ 

Proof of Corollary D.2. The corollary follows from the proof of Corollary D.2, by taking f = 
1/1000 and computing the corresponding e and e' — e to check that indeed linin^oo > 100. □ 
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